SpatiaLab

Can vision–language models perform spatial reasoning in the wild?
ICLR 2026
1,400 visual QA pairs
6 categories • 30 subcategories
MCQ + Open-ended

SpatiaLab: Can Vision–Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez

Computational Intelligence and Operations Laboratory (CIOL) • Shahjalal University of Science and Technology (SUST) • Monash University • Qatar Computing Research Institute (QCRI)

Accepted to The Fourteenth International Conference on Learning Representations (ICLR 2026)

📄 OpenReview 📄 arXiv 🤗 Hugging Face 🤗 HF (Paper) Kaggle GitHub

Spatial reasoning is fundamental to human intelligence and real-world embodied AI. SpatiaLab provides a comprehensive evaluation suite (1,400 QA pairs) across six core spatial categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry. It exposes large gaps between SOTA models and humans.

SpatiaLab overview figure (boxed)

Benchmark Structure and Categorization

Category Example sub-tasks (5 each)
Relative Positioning Left/Right, Above/Below, Between, Adjacency, Corner/Angle
Depth & Occlusion Partial occlusion, Complete occlusion, Layer order, Reflection/visibility, Hidden feature
Orientation Rotation angle, Facing, Tilt, Tool handedness, Mirror
Size & Scale Relative size, Scale ratio, Big/Small, Proportion, Size consistency
Spatial Navigation Path existence, Obstacle avoidance, Turn sequence, Viewpoint visibility, Accessibility
3D Geometry 3D containment, Intersection, Volume ordering, Pose matching, Stability

Data Collection and Annotation

Web crawling, targeted retrieval, and manual snapshots. 3-stage annotation / verification to produce the final 1,400 QA items.

Data pipeline figure (export Figure 3 here)

Task formats & sample instances

Every subcategory includes MCQ and open-ended forms so we can measure discriminative accuracy and generative reasoning capacity. Representative examples below.

Combined example: MCQ and Open-ended

Annotation & QC

Annotators were trained; items passed a 3-stage review to produce gold-standard annotations. Fleiss’ kappa among annotators: 0.774; LLM judge agreement: Cohen’s kappa ≈ 0.738 (details in appendix).

Results

Tables below are transcribed from the uploaded paper PDF.

Multiple Choice Evaluation Accuracy (%) on SpatiaLab-MCQ

Model 3D Geom.
(#238)
Dep. & Occu.
(#259)
Orientation
(#202)
Relat. Posit.
(#212)
Size & Scale
(#252)
Spati. Navig.
(#237)
Overall
(#1400)
Random Choice25.0025.0025.0025.0025.0025.0025.00
GPT-4o-mini47.0639.0047.0347.1749.6049.7946.50
GPT-5-mini48.7454.8360.4062.7444.8456.5454.29
Gemini-2.0-Flash47.0655.2153.9658.0254.3746.8452.50
Gemini-2.5-Flash44.9648.2648.0256.1342.4651.0548.29
Claude 3.5 Haiku42.4442.0846.5346.2335.7145.9942.93
Mistral Medium 3.146.6449.8147.5261.7941.6741.7747.93
InternVL3.5-1B33.6132.4323.2737.2631.7530.8031.64
InternVL3.5-2B34.0331.6631.6840.5732.5432.4933.71
Qwen-VL2.5-3B-Instruct41.1835.5246.0440.0947.2239.2441.43
InternVL3.5-4B42.8642.8642.0854.7236.5142.1943.29
Gemma-3-4B-it43.7034.3646.5345.7537.3037.9740.57
Qwen-VL2.5-7B-Instruct42.8637.8442.5746.2342.0635.4441.00
Llama-3.2-11B-Vision-Instruct26.4730.5020.3042.9230.5632.0730.50
Gemma-3-27B-it43.2840.1548.0254.2548.0247.2646.57
Qwen-VL2.5-32B-Instruct41.1840.1546.5345.2845.2441.7743.21
InternVL3.5-72B50.0057.1453.4766.0449.2154.8554.93
Qwen-VL2.5-72B-Instruct47.0648.6551.9854.2543.6548.9548.86
Llama-3.2-90B-Vision-Instruct46.2252.1250.5058.9646.8348.5250.36
o4-mini-medium51.2658.3054.9564.1540.8751.4853.21
Gemini-2-Flash-Thinking37.8241.3141.5845.7550.4043.0443.36
Gemini-2.5-Flash-Thinking45.8053.6752.9756.6055.1653.5952.93
SpaceOm42.4438.6148.0237.7442.8639.2441.36
SpaceThinker-Qwen2.5VL-3B40.3437.8447.0338.2143.2537.9740.64
SpaceQwen2.5-VL-3B-Instruct31.5135.1437.6237.7450.7947.2640.14
Human Baseline93.7074.1391.5891.5188.8987.7687.57

Open-ended Evaluation Accuracy (%) on SpatiaLab-OPEN

Model 3D Geom.
(#238)
Dep. & Occu.
(#259)
Orientation
(#202)
Relat. Posit.
(#212)
Size & Scale
(#252)
Spati. Navig.
(#237)
Overall
(#1400)
GPT-4o-mini23.5316.6023.2730.6617.8621.9426.00
GPT-5-mini45.3834.7537.1349.5342.4637.1340.93
Gemini-2.0-Flash31.9324.3227.2331.1326.1924.4727.43
Gemini-2.5-Flash34.0326.6431.6838.6826.5929.5430.93
Claude 3.5 Haiku26.0518.9224.7525.9420.2421.1022.64
Mistral Medium 3.125.2119.3121.7829.2515.0816.8821.00
InternVL3.5-1B05.8809.6509.9013.6809.1310.1309.64
InternVL3.5-2B12.1811.2010.8923.5811.9018.1414.50
Qwen-VL2.5-3B-Instruct15.5508.4915.3510.8518.2509.2812.93
InternVL3.5-4B19.3317.7615.8419.8116.2718.9918.00
Gemma-3-4B-it20.1713.1314.8523.5815.0819.8317.64
Qwen-VL2.5-7B-Instruct15.1315.8320.3027.8315.8719.8318.86
Llama-3.2-11B-Vision-Instruct16.8116.9922.2825.0013.4918.5718.57
Gemma-3-27B-it22.6916.2224.7534.4322.6221.9423.43
InternVL3.5-72B22.6920.4620.3031.6019.8426.1623.36
Qwen-VL2.5-72B-Instruct26.8920.8525.2530.6624.6020.6824.64
Llama-3.2-90B-Vision-Instruct22.6923.1721.2928.3021.8327.0024.00
GLM-4.5V-106B-MoE31.0920.4625.2526.4224.2124.4725.21
o4-mini-medium40.7632.8232.1842.9244.0534.1837.86
Gemini-2-Flash-Thinking31.0927.4131.1934.4329.3729.5430.36
Gemini-2.5-Flash-Thinking37.1445.4536.3637.1421.7422.2232.77
SpaceOm12.6106.9515.8411.7918.6512.2412.93
SpaceThinker-Qwen2.5VL-3B13.4509.2717.8210.3819.4410.1313.36
SpaceQwen2.5-VL-3B-Instruct12.6103.8613.8609.4311.9011.3910.36
Human Baseline73.5350.1970.3069.8165.4862.8764.93

Key Takeways

  • Overall Findings:
    • SpatiaLab-MCQ:
      • Model accuracies vary from ~30–55% (random choice = 25%), human baseline = 87.57%.
      • Top models include InternVL3.5-72B (54.93%), GPT-5-mini (54.29%), o4-mini (53.21%), and Gemini-2.5-Flash-Thinking (52.93%).
      • Model scale alone is not determinative (e.g., Llama-3.2-11B = 30.50%).
      • Orientation and 3D Geometry subtasks show highest performance (>60% for several models); Spatial Navigation, Depth & Occlusion, Size & Scale are harder.
      • Spatially specialized models (SpaceOm, SpaceThinker, SpaceQwen) achieve mid-40s to low-40s, showing specialization does not guarantee higher MCQ accuracy.
      • Depth & Occlusion and Size & Scale performance varies across architectures, highlighting differences in inductive biases or training data.
    • SpatiaLab-Open:
      • Open-ended accuracy ranges from ~9.6% (InternVL3.5-1B) to 40.9% (GPT-5-mini), human baseline = 64.9%.
      • Proprietary and reasoning-tuned models lead (GPT-5-mini 40.93%, o4-mini 37.86%, Gemini-2.5-Flash-Thinking 32.77%).
      • Small open-source models cluster at the bottom (~10–15%), very large models improve modestly (20–25%).
      • Spatial specialists show limited open-ended performance (10–13%), indicating architecture or task specialization alone is insufficient.
      • Orientation and relative position subtasks have comparatively higher performance (GPT-5-mini Rel.Pos = 49.53%, Ori = 37.13%), while depth, size, and navigation remain difficult (<30%).
      • Multi-step grounding and perceptual reasoning are bottlenecks for open-ended generation.
  • 5.2 Performance Drop in Open-Ended Evaluation Compared to MCQ
    • Average MCQ→Open-ended gap across 25 models = 23.0% (σ = 5.5%), subtask gaps range 22.89% (navigation) to 24.57% (3D geometry).
    • Specialist models exhibit largest gaps (~27%), especially in navigation (up to 36.68%) and orientation (34.44%).
    • Reasoning-oriented models have smaller gaps (~19%) and lower variance, reflecting stabilizing effect of instruction-tuning and CoT decoding.
    • Negative or near-zero gaps (e.g., Llama-3.2-11B: -1.98 in Depth & Occlusion; o4-mini: -3.18 in Relative Position) indicate MCQ distractors can misrepresent competence.
    • Format sensitivity likely arises from MCQ structure, specialization bias, sequential reasoning challenges, and generative calibration differences.
    • Recommendation: complement MCQs with open-ended evaluations, audit distractors, use stepwise generation or instruction-tuning, and report per-subtask diagnostics.

Error Analysis

  • SPATIALAB-MCQ:
    • VLMs achieve selective peaks but lack holistic spatial competence .
    • Closed-source models reach high scores (e.g., 85.71% in Stacking Orientation) but collapse to near-chance in Relative Size Comparison.
    • Open-source scaling improves ceilings (80.56% in Corner/Angle Positioning) but catastrophic failures remain (2.0% in Object Rotation).
    • Reasoning-augmented and spatially specialized models add localized gains (occlusion inference, navigation) but remain below 55% in many physical abstraction tasks (Gravity Effects, Stability Prediction).
    • Error patterns reveal weaknesses in:
      • Embodied reasoning (Tool Handedness)
      • Recursive relational chaining (Pathway Existence)
      • Non-local cues (Reflective Surfaces)
    • Stronger results in Obstacle Avoidance suggest shortcut exploitation rather than robust planning.
    • Conclusion: Current VLMs rely on surface correlations and lack stable encodings for orientation, physics, and compositional logic. Progress requires richer training distributions and architectural mechanisms for geometric and reference-frame grounding.
  • SPATIALAB-OPEN:
    • Top closed-source models (e.g., GPT-5-mini: 58.14% in Directional Relations) collapse on tasks like Proximity Gradients.
    • Large open-source models show strong Depth & Occlusion performance (e.g., 60%) but fail on Proximity (9.3%).
    • Smaller models exhibit isolated strengths (e.g., 26% in Relative Size Comparison) but often break down completely (0% in Betweenness).
    • Reasoning-tuned models show promise (Gemini-2.5-Flash-Thinking = 75% in Tool Handedness) yet similar models fail in single digits.
    • Specialized spatial models perform worst overall (20%) and frequently fail core relational tasks.
    • Failure patterns highlight systemic weaknesses in:
      • Occlusion handling
      • Orientation
      • Multi-step relational chaining
    • Partial success on size & scale cues indicates reliance on superficial correlations.
    • Root causes: missing spatial analogy representations, architectural limits, brittle reasoning pipelines. Moving forward requires unified geometric encodings, physics-aware reasoning, and embodied data.
  • Qualitative Error Analysis:
    • Failures cluster into recurring classes, not random noise.
    • Common error types:
      • Spatial mislocalization: Confusing referents in crowded scenes.
      • Perspective and scale mistakes: Overreliance on object-size priors.
      • Occlusion/ordering failures: Especially with thin or partially hidden structures.
      • Attribute confusion: Mixing perceptual and functional properties.
      • Ungrounded open-ended reasoning: Fluent narratives not grounded in visual input.
      • Multi-cue integration failures: Struggle to combine depth, relative size, and ordering.
      • Poor confidence calibration: Especially in open-ended generative tasks.
    • Diagnostic findings: Models ignore minimal decisive visual features, rely on brittle heuristics, and fail to update with counterfactual edits.
    • Root causes: insufficient object-centric binding, lack of geometric supervision, training objectives favor plausibility over grounding.
    • Conclusion: Current VLMs achieve strong coarse perception but struggle with multi-cue integration and grounded reasoning, highlighting the need for geometry-aware supervision, multi-scale feature retention, and verification pipelines.

Methods

Images collected via web crawling, targeted retrieval, and manual capture; annotations via trained annotators with 3-tier QC. Evaluation uses direct prompting for MCQ and LLM judge with human validation for open-ended items. See paper for exact prompts and appendices.

Evaluation protocol

Performance Improvement Approaches

  • Inherent Reasoning Mechanisms: Models with built-in reasoning consistently outperform baselines across both MCQ and open-ended tasks, with the largest gains in relational and orientation categories (e.g., +13.1% in Relative Positioning for MCQs). While reasoning stabilizes open-ended performance, improvements remain uneven—particularly declining in Size & Scale—indicating that reasoning modules boost logical consistency but do not fully solve grounding or scale sensitivity.
  • Chain-of-Thought (CoT) Prompting: CoT prompting alone provides limited benefits and sometimes reduces accuracy, with orientation tasks showing the only consistent improvement. Step-by-step reasoning aids directional alignment but struggles on perceptually grounded tasks such as depth, scale, and navigation, occasionally amplifying flawed priors.
  • Chain-of-Thought (CoT) Prompting with Self-Reflection: Incorporating self-reflection with CoT yields modest gains in MCQ tasks, particularly for geometry and depth, but does not generalize to open-ended tasks.
  • Supervised Fine-Tuning (SFT): SFT consistently improves MCQ accuracy across all spatial reasoning categories, but transfer to open-ended tasks is limited or negative, suggesting overfitting to task-specific distributions. The drop in generative performance may result from biased internal representations and catastrophic forgetting of linguistic priors during fine-tuning.
  • AI Agents for Spatial Reasoning: SpatioXolver, a multi-agent system adapted from Xolver, is designed for structured spatial reasoning on images in SpatioLab. Sub-tasks on objects, attributes, relations, or transformations are assigned to specialized agents. Results show strong gains in orientation (+36% open-ended) but declines in depth, occlusion, and navigation.

Citation

Please cite the paper as below:

@inproceedings{
wasi2026spatialab,
title={SpatiaLab: Can Vision{\textendash}Language Models Perform Spatial Reasoning in the Wild?},
author={Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=fWWUPOb0CT}
}
      
CIOL Logo Monash University Logo QCRI Logo