XLRS-Bench

Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang1 , Hongzhen Wang2, Zonghao Guo2, Di Wang3,5, Yulin Wang2, Mingshuo Chen4, Qiang Ma2, Long Lan1 , Wenjing Yang1* , Jing Zhang3,6, Zhiyuan Liu2, Maosong Sun2
1College of Computer Science and Technology, National University of Defense Technology, 2Tsinghua University, 3School of Computer Science, Wuhan University, 4Beijing University of Posts and Telecommunications, 5Zhongguancun Academy, 6School of Artificial Intelligence, Wuhan University
CVPR 2025 Highlight

Introduction

   Remote sensing (RS) images have become essential for monitoring and understanding human environments, driving advancements in applications like precision agriculture, urban planning, and disaster assessment. While recent studies have proposed benchmarks and metrics to assess MLLM performance in RS, these efforts remain limited in image size, annotation method and evaluation dimensions.
   We present XLRS-Bench, a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios, featuring the largest average image size of 8,500 × 8,500 observed thus far. Our dataset encompasses 45,942 annotations across 16 tasks, all expertly curated by a team of 45 experts. The main advantages of XLRS-Bench compared to existing MLLM benchmarks as follows:
    1. Ultra-high Resolution. XLRS-Bench features the largest image sizes available, 10∼20× than that of existing datasets, with 840 images out of all images at a resolution of 10,000×10,000 pixels
   2. High-quality Annotation. All the annotations are human involved and manually verified through iterations, resulting in a high-quality benchmark for evaluating MLLMs on real ultra-high-resolution RS scenarios.
   3. Comprehensive Evaluation Dimensions: XLRS-Bench covers 10 perception indicators and 6 reasoning dimensions to assess MLLMs’ capabilities, encompassing 16 sub-tasks with a total of 45,942 questions. Especially, XLRS-Bench includes complex reasoning tasks to explore MLLMs’ potential in conducting planning and change detection in long spatial-temporal RS scenarios.

Benchmark Comparison

data-composition

Comparison of XLRS-Bench with other benchmarks: , and  separately represent the annotations are machine generated, manually written and semi-automated, i.e., machine generation followed by human verification.

Leaderboard

Leaderboard

Models are ranked according to their average performance on perception and reasoning tasks, from highest to lowest. "Counting", "Scene Classification", and so on each indicate a specific L2 sub-task respectively. “Avg” indicate the weighted average accuracy across all L2 sub-tasks in each domain. By default, this leaderboard is sorted by results with Overall. To view other sorted results, please click on the corresponding cell.

Results on English VQA benchmark.

# Method Lang Perception Reasoning Avg
Task Split Counting Scene Classification Object Spatial Relationship Object Properties Complex Reasoning Planning Spatiotemporal Reasoning Anomaly Reasoning
1 CogVLM2
Llama3-8B
en 36.6646.2634.5336.89 60.4233.54-69.6739.58
2 Qwen2-VL
Qwen2-7B
en 39.7248.3030.7935.07 64.1931.7745.9368.9738.81
3 GPT-4o-mini
Proprietary
en 28.3245.4629.7436.29 55.1340.6215.1972.8638.24
4 LLaVA-OneVision
Qwen2-7B
en 35.8447.7626.9635.95 60.6124.4237.7872.0638.09
5 InternVL2
InternLM2.5-7B
en 33.9144.9322.1034.84 66.4333.0144.4474.1837.21
6 InternLM-XComposer-2.5
InternLM2-7B
en 35.8447.4328.7430.60 64.7635.7532.2269.8535.76
7 LLaVA-Next
Llama3-8B
en 38.0040.7931.1031.22 62.0426.1932.2269.6735.48
8 GPT-4o
Proprietary
en 29.5148.5529.5724.78 52.3642.6521.8571.0031.73
9 LLaVA-1.5
Vicuna-7B
en 22.6519.6923.2220.21 33.8638.6729.2637.4922.93
10 GeoChat
Vicuna-7B
en 22.6518.9623.6420.21 30.9532.57-34.3922.18

Results on Chinese VQA benchmark.

# Method Lang Perception Reasoning Avg
Task Split Counting Scene Classification Object Spatial Relationship Object Properties Complex Reasoning Planning Spatiotemporal Reasoning Anomaly Reasoning
Qwen2-VL
Qwen2-7B
zh 39.4949.2833.2637.89 67.6224.3444.4476.5741.10
InternVL2
InternLM2.5-7B
zh 34.4239.1234.2437.54 68.5740.0944.0776.5740.58
LLaVA-OneVision
Qwen2-7B
zh 37.4847.0631.7334.72 62.5629.1230.3774.8038.42
GPT-4o-mini
Proprietary
zh 29.5144.1329.5935.41 55.3641.8621.8574.7137.82
InternLM-XComposer-2.5
InternLM2-7B
zh 37.7043.1532.6233.40 62.9929.4724.4469.1437.26
CogVLM2
Llama3-8B
zh 36.2145.7934.5731.03 59.5626.19-70.4735.84
LLaVA-Next
Llama3-8B
zh 33.0839.5231.9829.34 54.0721.5925.5669.8533.48
GPT-4o
Proprietary
zh 22.2845.4731.2524.93 45.4527.0815.1969.4130.40
LLaVA-1.5
Vicuna-7B
zh 22.6519.0723.0120.21 33.9538.6729.2637.5822.86
GeoChat
Vicuna-7B
zh 22.6519.1723.0520.21 24.7522.79-23.7420.99

Results on English Captioning benchmark.

# Method Language BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE_L
GeoChat en 16.74 8.38 4.49 2.45 10.37 16.72
GPT-4o en 34.69 17.67 8.56 4.04 23.54 20.93
GPT-4o-mini en 38.29 19.75 9.76 4.29 23.94 21.30
Qwen2-VL en 26.74 12.79 5.99 2.53 19.32 19.76
LLaVA-OneVision en 41.12 20.42 9.94 4.56 19.99 21.03
LLaVA-Next en 27.62 13.45 6.82 3.52 17.78 20.65
LLaVA-1.5 en 35.82 17.62 8.92 4.33 16.49 20.80
CogVLM2 en 30.27 14.46 6.80 3.09 19.37 19.17
InternLM-XComposer-2.5 en 35.17 15.91 7.00 3.02 19.99 17.95
InternVL2 en 25.71 12.44 5.84 2.58 19.55 19.43

Results on Chinese Captioning benchmark.

# Method Language BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE_L
GeoChat zh 6.77 1.49 0.68 0.26 7.84 15.79
GPT-4o zh 31.08 3.86 1.43 0.43 26.41 36.41
GPT-4o-mini zh 34.13 5.37 2.20 0.58 25.73 37.11
Qwen2-VL zh 21.80 3.50 1.41 0.33 22.92 31.04
LLaVA-OneVision zh 33.05 5.67 2.47 0.98 20.24 31.95
LLaVA-Next zh 13.01 2.10 0.82 0.20 15.12 27.72
LLaVA-1.5 zh 28.56 4.26 1.73 0.00 16.36 29.18
CogVLM2 zh 19.78 2.23 0.79 0.18 22.53 28.33
InternLM-XComposer-2.5 zh 37.30 6.12 2.39 0.58 20.86 32.97
InternVL2 zh 16.49 3.16 1.39 0.48 22.10 25.76

Results on Visual Grounding benchmark.

# Benchmark Method Performance
GPT-4o GPT-4o-mini Qwen2-VL LLaVA-OneVision LLaVA-Next LLaVA-1.5 CogVLM2 InternLM-XComposer-2.5 InternVL2 GeoChat
1 XLRS-Bench-EN Acc@0.5 0.46 0.09 0.15 0.16 0.18 0.09 0.01 0.02 0.33 0.14
2 XLRS-Bench-EN Acc@0.7 0.05 0.03 0.03 0.00 0.04 0.00 0.00 0.01 0.12 0.01
3 XLRS-Bench-ZH Acc@0.5 0.45 0.21 0.14 0.13 0.07 0.12 0.03 0.06 0.19 0.14
4 XLRS-Bench-ZH Acc@0.7 0.03 0.03 0.01 0.01 0.02 0.02 0.00 0.00 0.06 0.01

Benchmark

Data Example

All data are freshly collected and human-annotated, with superior resolution, task complexity, and real-world utility.

teaser_tasks

Example of XLRS-Bench in English. XLRS-Bench focuses on large-size ultra-high-resolution remote sensing imagery, integrating over 10 multimodal perception and reasoning tasks within the same image.

Benchmark Statistics

Task Categories: Our benchmark spans 10 level-2 tasks and 16 level-3 sub-tasks, including 1,400 high-resolution images and 45,942 annotations.

Experiment Results

Experimental Results on L3 Sub-tasks

Citation


      @article{wang2025xlrsbench,
        title={XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?},
        author={Wang, Fengxiang and Wang, Hongzhen and Chen, Mingshuo and Wang, Di and Wang, Yulin and Guo, Zonghao and Ma, Qiang and Lan, Long and Yang, Wenjing and Zhang, Jing and others},
        journal={arXiv preprint arXiv:2503.23771},
        year={2025}
      }