Models are ranked according to their average performance on perception and reasoning tasks, from highest to lowest. "Counting", "Scene Classification", and so on each indicate a specific L2 sub-task respectively. “Avg” indicate the weighted average accuracy across all L2 sub-tasks in each domain. By default, this leaderboard is sorted by results with Overall. To view other sorted results, please click on the corresponding cell.
Results on English VQA benchmark.
# | Method | Lang | Perception | Reasoning | Avg | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Task Split | Counting | Scene Classification | Object Spatial Relationship | Object Properties | Complex Reasoning | Planning | Spatiotemporal Reasoning | Anomaly Reasoning | |||
1 | CogVLM2 Llama3-8B |
en | 36.66 | 46.26 | 34.53 | 36.89 | 60.42 | 33.54 | - | 69.67 | 39.58 |
2 | Qwen2-VL Qwen2-7B |
en | 39.72 | 48.30 | 30.79 | 35.07 | 64.19 | 31.77 | 45.93 | 68.97 | 38.81 |
3 | GPT-4o-mini Proprietary |
en | 28.32 | 45.46 | 29.74 | 36.29 | 55.13 | 40.62 | 15.19 | 72.86 | 38.24 |
4 | LLaVA-OneVision Qwen2-7B |
en | 35.84 | 47.76 | 26.96 | 35.95 | 60.61 | 24.42 | 37.78 | 72.06 | 38.09 |
5 | InternVL2 InternLM2.5-7B |
en | 33.91 | 44.93 | 22.10 | 34.84 | 66.43 | 33.01 | 44.44 | 74.18 | 37.21 |
6 | InternLM-XComposer-2.5 InternLM2-7B |
en | 35.84 | 47.43 | 28.74 | 30.60 | 64.76 | 35.75 | 32.22 | 69.85 | 35.76 |
7 | LLaVA-Next Llama3-8B |
en | 38.00 | 40.79 | 31.10 | 31.22 | 62.04 | 26.19 | 32.22 | 69.67 | 35.48 |
8 | GPT-4o Proprietary |
en | 29.51 | 48.55 | 29.57 | 24.78 | 52.36 | 42.65 | 21.85 | 71.00 | 31.73 |
9 | LLaVA-1.5 Vicuna-7B |
en | 22.65 | 19.69 | 23.22 | 20.21 | 33.86 | 38.67 | 29.26 | 37.49 | 22.93 |
10 | GeoChat Vicuna-7B |
en | 22.65 | 18.96 | 23.64 | 20.21 | 30.95 | 32.57 | - | 34.39 | 22.18 |