XLRS-Bench

Introduction

   Remote sensing (RS) images have become essential for monitoring and understanding human environments, driving advancements in applications like precision agriculture, urban planning, and disaster assessment. While recent studies have proposed benchmarks and metrics to assess MLLM performance in RS, these efforts remain limited in image size, annotation method and evaluation dimensions.
   We present XLRS-Bench, a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios, featuring the largest average image size of 8,500 × 8,500 observed thus far. Our dataset encompasses 45,942 annotations across 16 tasks, all expertly curated by a team of 45 experts. The main advantages of XLRS-Bench compared to existing MLLM benchmarks as follows:
    1. Ultra-high Resolution. XLRS-Bench features the largest image sizes available, 10∼20× than that of existing datasets, with 840 images out of all images at a resolution of 10,000×10,000 pixels
   2. High-quality Annotation. All the annotations are human involved and manually verified through iterations, resulting in a high-quality benchmark for evaluating MLLMs on real ultra-high-resolution RS scenarios.
   3. Comprehensive Evaluation Dimensions: XLRS-Bench covers 10 perception indicators and 6 reasoning dimensions to assess MLLMs’ capabilities, encompassing 16 sub-tasks with a total of 45,942 questions. Especially, XLRS-Bench includes complex reasoning tasks to explore MLLMs’ potential in conducting planning and change detection in long spatial-temporal RS scenarios.

Benchmark Comparison

Comparison of XLRS-Bench with other benchmarks: , and separately represent the annotations are machine generated, manually written and semi-automated, i.e., machine generation followed by human verification.

Leaderboard

Models are ranked according to their average performance on perception and reasoning tasks, from highest to lowest. "Counting", "Scene Classification", and so on each indicate a specific L2 sub-task respectively. “Avg” indicate the weighted average accuracy across all L2 sub-tasks or macro average across all L3 tasks. We will add both metrics soon. By default, this leaderboard is sorted by results with Avg.. To view other sorted results, please click on the corresponding cell. The abbreviations “OC”, “RC”, “OLUC”, etc. each indicate a specific task domain; please refer to the paper’s appendix or the “Experimental Results” section below.

Results on XLRS-Bench-lite.

Results on English VQA benchmark.

#	Method	Lang	Perception				Reasoning				Avg
#	Task Split		Counting	Scene Classification	Object Spatial Relationship	Object Properties	Complex Reasoning	Planning	Spatiotemporal Reasoning	Anomaly Reasoning	Avg
1	CogVLM2 Llama3-8B	en	36.66	46.26	34.53	36.89	60.42	33.54	-	69.67	39.58
2	Qwen2-VL Qwen2-7B	en	39.72	48.30	30.79	35.07	64.19	31.77	45.93	68.97	38.81
3	GPT-4o-mini Proprietary	en	28.32	45.46	29.74	36.29	55.13	40.62	15.19	72.86	38.24
4	LLaVA-OneVision Qwen2-7B	en	35.84	47.76	26.96	35.95	60.61	24.42	37.78	72.06	38.09
5	InternVL2 InternLM2.5-7B	en	33.91	44.93	22.10	34.84	66.43	33.01	44.44	74.18	37.21
6	InternLM-XComposer-2.5 InternLM2-7B	en	35.84	47.43	28.74	30.60	64.76	35.75	32.22	69.85	35.76
7	LLaVA-Next Llama3-8B	en	38.00	40.79	31.10	31.22	62.04	26.19	32.22	69.67	35.48
8	GPT-4o Proprietary	en	29.51	48.55	29.57	24.78	52.36	42.65	21.85	71.00	31.73
9	LLaVA-1.5 Vicuna-7B	en	22.65	19.69	23.22	20.21	33.86	38.67	29.26	37.49	22.93
10	GeoChat Vicuna-7B	en	22.65	18.96	23.64	20.21	30.95	32.57	-	34.39	22.18

Results on Chinese VQA benchmark.

#	Method	Lang	Perception				Reasoning				Avg
#	Task Split		Counting	Scene Classification	Object Spatial Relationship	Object Properties	Complex Reasoning	Planning	Spatiotemporal Reasoning	Anomaly Reasoning	Avg
	Qwen2-VL Qwen2-7B	zh	39.49	49.28	33.26	37.89	67.62	24.34	44.44	76.57	41.10
	InternVL2 InternLM2.5-7B	zh	34.42	39.12	34.24	37.54	68.57	40.09	44.07	76.57	40.58
	LLaVA-OneVision Qwen2-7B	zh	37.48	47.06	31.73	34.72	62.56	29.12	30.37	74.80	38.42
	GPT-4o-mini Proprietary	zh	29.51	44.13	29.59	35.41	55.36	41.86	21.85	74.71	37.82
	InternLM-XComposer-2.5 InternLM2-7B	zh	37.70	43.15	32.62	33.40	62.99	29.47	24.44	69.14	37.26
	CogVLM2 Llama3-8B	zh	36.21	45.79	34.57	31.03	59.56	26.19	-	70.47	35.84
	LLaVA-Next Llama3-8B	zh	33.08	39.52	31.98	29.34	54.07	21.59	25.56	69.85	33.48
	GPT-4o Proprietary	zh	22.28	45.47	31.25	24.93	45.45	27.08	15.19	69.41	30.40
	LLaVA-1.5 Vicuna-7B	zh	22.65	19.07	23.01	20.21	33.95	38.67	29.26	37.58	22.86
	GeoChat Vicuna-7B	zh	22.65	19.17	23.05	20.21	24.75	22.79	-	23.74	20.99

Results on English Captioning benchmark.

Method	Language	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE_L
GeoChat	en	16.74	8.38	4.49	2.45	10.37	16.72
GPT-4o	en	34.69	17.67	8.56	4.04	23.54	20.93
GPT-4o-mini	en	38.29	19.75	9.76	4.29	23.94	21.30
Qwen2-VL	en	26.74	12.79	5.99	2.53	19.32	19.76
LLaVA-OneVision	en	41.12	20.42	9.94	4.56	19.99	21.03
LLaVA-Next	en	27.62	13.45	6.82	3.52	17.78	20.65
LLaVA-1.5	en	35.82	17.62	8.92	4.33	16.49	20.80
CogVLM2	en	30.27	14.46	6.80	3.09	19.37	19.17
InternLM-XComposer-2.5	en	35.17	15.91	7.00	3.02	19.99	17.95
InternVL2	en	25.71	12.44	5.84	2.58	19.55	19.43

Results on Chinese Captioning benchmark.

Method	Language	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE_L
GeoChat	zh	6.77	1.49	0.68	0.26	7.84	15.79
GPT-4o	zh	31.08	3.86	1.43	0.43	26.41	36.41
GPT-4o-mini	zh	34.13	5.37	2.20	0.58	25.73	37.11
Qwen2-VL	zh	21.80	3.50	1.41	0.33	22.92	31.04
LLaVA-OneVision	zh	33.05	5.67	2.47	0.98	20.24	31.95
LLaVA-Next	zh	13.01	2.10	0.82	0.20	15.12	27.72
LLaVA-1.5	zh	28.56	4.26	1.73	0.00	16.36	29.18
CogVLM2	zh	19.78	2.23	0.79	0.18	22.53	28.33
InternLM-XComposer-2.5	zh	37.30	6.12	2.39	0.58	20.86	32.97
InternVL2	zh	16.49	3.16	1.39	0.48	22.10	25.76

Results on Visual Grounding benchmark.

#	Benchmark	Method	Performance
#	Benchmark	Method	GPT-4o	GPT-4o-mini	Qwen2-VL	LLaVA-OneVision	LLaVA-Next	LLaVA-1.5	CogVLM2	InternLM-XComposer-2.5	InternVL2	GeoChat
1	XLRS-Bench-EN	Acc@0.5	0.46	0.09	0.15	0.16	0.18	0.09	0.01	0.02	0.33	0.14
2	XLRS-Bench-EN	Acc@0.7	0.05	0.03	0.03	0.00	0.04	0.00	0.00	0.01	0.12	0.01
3	XLRS-Bench-ZH	Acc@0.5	0.45	0.21	0.14	0.13	0.07	0.12	0.03	0.06	0.19	0.14
4	XLRS-Bench-ZH	Acc@0.7	0.03	0.03	0.01	0.01	0.02	0.02	0.00	0.00	0.06	0.01

Data Example

All data are freshly collected and human-annotated, with superior resolution, task complexity, and real-world utility.

Example of XLRS-Bench in English. XLRS-Bench focuses on large-size ultra-high-resolution remote sensing imagery, integrating over 10 multimodal perception and reasoning tasks within the same image.

Benchmark Statistics

Task Categories: Our benchmark spans 10 level-2 tasks and 16 level-3 sub-tasks, including 1,400 high-resolution images and 45,942 annotations.

Experimental Results on L3 Sub-tasks

(1) Experimental results on perception tasks. “OC”, “RC”, “OLUC”, “RLUC”, “OSR”,“OCC”,“OCL” and “OMS” each indicate a specific task domain: Overall counting, Regional Counting, Overall Land Use classification, Regional Land Use Classification, Object Spatial Relationship, Object Classification, Object Color and Object Motion State.

(2) Experimental results on reasoning tasks. “RP”, “AD”, “ECR”, “CCR” and “RCCD” each indicate a specific task domain: Route Planning, Anomaly Detection, Environmental Conditional Reasoning, Counting with Complex Reasoning and Regional Counting with Change Detection.

(3) Experimental results on visual grounding tasks.


      @article{wang2025xlrsbench,
        title={XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?},
        author={Wang, Fengxiang and Wang, Hongzhen and Chen, Mingshuo and Wang, Di and Wang, Yulin and Guo, Zonghao and Ma, Qiang and Lan, Long and Yang, Wenjing and Zhang, Jing and others},
        journal={arXiv preprint arXiv:2503.23771},
        year={2025}
      }

XLRS-Bench

Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Introduction

Benchmark Comparison

Leaderboard

Benchmark

Data Example

Benchmark Statistics

Experiment Results

Experimental Results on L3 Sub-tasks

Citation