Overview¶
The evaluation module of align-anything supports a variety of multimodal benchmarks, such as Text→Text, Text+Image/Video/Audio→Text, and Text→Image/Video/Audio. For most modalities, we provide vLLM and Deepspeed as generation backends, allowing users to choose based on their devices and environments. Since diffusion generation models for audio and video modalities primarily rely on different frameworks, we provide an option to load local files directly for evaluation in Text→Image/Audio/Video tasks (development completed and testing underway).
Support Benchmarks¶
| Benchmark | Modality | Support Backend |
|---|---|---|
| ARC | Text→Text | vLLM |
| BBH | Text→Text | vLLM |
| Belebele | Text→Text | vLLM |
| CMMLU | Text→Text | vLLM |
| GSM8K | Text→Text | vLLM |
| HumanEval | Text→Text | vLLM |
| MMLU | Text→Text | vLLM |
| MMLU-Pro | Text→Text | vLLM |
| MT-Bench | Text→Text | vLLM |
| PAWS-X | Text→Text | vLLM |
| RACE | Text→Text | vLLM |
| TruthfulQA | Text→Text | vLLM |
| A-OKVQA | Text+Image→Text | vLLM/DeepSpeed |
| LLaVA-Bench(COCO) | Text+Image→Text | vLLM |
| LLaVA-Bench(wild) | Text+Image→Text | vLLM |
| MathVista | Text+Image→Text | vLLM/DeepSpeed |
| MM-SafetyBench | Text+Image→Text | vLLM/DeepSpeed |
| MMBench | Text+Image→Text | vLLM/DeepSpeed |
| MME | Text+Image→Text | vLLM/DeepSpeed |
| MMMU | Text+Image→Text | vLLM |
| MMStar | Text+Image→Text | vLLM/DeepSpeed |
| MMVet | Text+Image→Text | vLLM/DeepSpeed |
| POPE | Text+Image→Text | vLLM/DeepSpeed |
| ScienceQA | Text+Image→Text | vLLM |
| SPA-VL | Text+Image→Text | vLLM/DeepSpeed |
| TextVQA | Text+Image→Text | vLLM |
| VizWizVQA | Text+Image→Text | vLLM/DeepSpeed |
| AIR-Bench | Text+Audio→Text | DeepSpeed |
| MVBench | Text+Video→Text | vLLM |
| Video-MME | Text+Video→Text | vLLM |
| COCO-val2014-30k | Text→Image | Accelerate |
| HPSv2 | Text→Image | Accelerate |
| ImageReward | Text→Image | Accelerate |
| AudioCaps | Text→Audio | Accelerate |
| ChronoMagic-Bench | Text→Video | Accelerate |