Overview

The evaluation module of align-anything supports a variety of multimodal benchmarks, such as Text→Text, Text+Image/Video/Audio→Text, and Text→Image/Video/Audio. For most modalities, we provide vLLM and Deepspeed as generation backends, allowing users to choose based on their devices and environments. Since diffusion generation models for audio and video modalities primarily rely on different frameworks, we provide an option to load local files directly for evaluation in Text→Image/Audio/Video tasks (development completed and testing underway).

Support Benchmarks

Benchmark Table
Benchmark Modality Support Backend
ARC Text→Text vLLM
BBH Text→Text vLLM
Belebele Text→Text vLLM
CMMLU Text→Text vLLM
GSM8K Text→Text vLLM
HumanEval Text→Text vLLM
MMLU Text→Text vLLM
MMLU-Pro Text→Text vLLM
MT-Bench Text→Text vLLM
PAWS-X Text→Text vLLM
RACE Text→Text vLLM
TruthfulQA Text→Text vLLM
A-OKVQA Text+Image→Text vLLM/DeepSpeed
LLaVA-Bench(COCO) Text+Image→Text vLLM
LLaVA-Bench(wild) Text+Image→Text vLLM
MathVista Text+Image→Text vLLM/DeepSpeed
MM-SafetyBench Text+Image→Text vLLM/DeepSpeed
MMBench Text+Image→Text vLLM/DeepSpeed
MME Text+Image→Text vLLM/DeepSpeed
MMMU Text+Image→Text vLLM
MMStar Text+Image→Text vLLM/DeepSpeed
MMVet Text+Image→Text vLLM/DeepSpeed
POPE Text+Image→Text vLLM/DeepSpeed
ScienceQA Text+Image→Text vLLM
SPA-VL Text+Image→Text vLLM/DeepSpeed
TextVQA Text+Image→Text vLLM
VizWizVQA Text+Image→Text vLLM/DeepSpeed
AIR-Bench Text+Audio→Text DeepSpeed
MVBench Text+Video→Text vLLM
Video-MME Text+Video→Text vLLM
COCO-val2014-30k Text→Image Accelerate
HPSv2 Text→Image Accelerate
ImageReward Text→Image Accelerate
AudioCaps Text→Audio Accelerate
ChronoMagic-Bench Text→Video Accelerate