Overview¶

The evaluation module of align-anything supports a variety of multimodal benchmarks, such as Text→Text, Text+Image/Video/Audio→Text, and Text→Image/Video/Audio. For most modalities, we provide vLLM and Deepspeed as generation backends, allowing users to choose based on their devices and environments. Since diffusion generation models for audio and video modalities primarily rely on different frameworks, we provide an option to load local files directly for evaluation in Text→Image/Audio/Video tasks (development completed and testing underway).

Support Benchmarks¶

Benchmark Table

Benchmark	Modality	Support Backend
ARC	Text→Text	vLLM
BBH	Text→Text	vLLM
Belebele	Text→Text	vLLM
CMMLU	Text→Text	vLLM
GSM8K	Text→Text	vLLM
HumanEval	Text→Text	vLLM
MMLU	Text→Text	vLLM
MMLU-Pro	Text→Text	vLLM
MT-Bench	Text→Text	vLLM
PAWS-X	Text→Text	vLLM
RACE	Text→Text	vLLM
TruthfulQA	Text→Text	vLLM
A-OKVQA	Text+Image→Text	vLLM/DeepSpeed
LLaVA-Bench(COCO)	Text+Image→Text	vLLM
LLaVA-Bench(wild)	Text+Image→Text	vLLM
MathVista	Text+Image→Text	vLLM/DeepSpeed
MM-SafetyBench	Text+Image→Text	vLLM/DeepSpeed
MMBench	Text+Image→Text	vLLM/DeepSpeed
MME	Text+Image→Text	vLLM/DeepSpeed
MMMU	Text+Image→Text	vLLM
MMStar	Text+Image→Text	vLLM/DeepSpeed
MMVet	Text+Image→Text	vLLM/DeepSpeed
POPE	Text+Image→Text	vLLM/DeepSpeed
ScienceQA	Text+Image→Text	vLLM
SPA-VL	Text+Image→Text	vLLM/DeepSpeed
TextVQA	Text+Image→Text	vLLM
VizWizVQA	Text+Image→Text	vLLM/DeepSpeed
AIR-Bench	Text+Audio→Text	DeepSpeed
MVBench	Text+Video→Text	vLLM
Video-MME	Text+Video→Text	vLLM
COCO-val2014-30k	Text→Image	Accelerate
HPSv2	Text→Image	Accelerate
ImageReward	Text→Image	Accelerate
AudioCaps	Text→Audio	Accelerate
ChronoMagic-Bench	Text→Video	Accelerate