Grounded Reinforcement Learning for Visual Reasoning

Grounded Reinforcement Learning

for Visual Reasoning

Gabriel Sarch Snigdha Saha Naitik Khandelwal Ayush Jain

Michael Tarr Aviral Kumar Katerina Fragkiadaki

Carnegie Mellon University

Paper

Code

Models

Datasets

Demo

Abstract

While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks—including SAT-2 and BLINK for spatial reasoning, Vbench for visual search, and ScreenSpot and VisualWebArena for web-based grounding—ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL’s performance on localizing small GUI elements and visual search, achieving 86.4% on V^*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model’s visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

ViGoRL Model Outputs

Spatial Reasoning

‹

›

Web Action Prediction

‹

›

Web Grounding

‹

›

Approach

ViGoRL (Visually Grounded Reinforcement Learning) turns a generic vision–language model into a grounded visual reasoner. We (1) warm-start the model with MCTS-generated reasoning traces that bind every thought to an (x,y) image coordinate, then (2) apply GRPO to reinforce trajectories that are both correct and properly grounded. A final (optional) stage upgrades the model to multi-turn RL, allowing it to request high-resolution crops at its predicted coordinates—mirroring how humans “zoom in” when they need finer detail.

Stage 1 · MCTS-Guided Warm Start

We probe a frozen, high-capacity teacher (Qwen2.5-VL-72B) with the current image ✕ query pair. A Monte-Carlo Tree Search expands nodes that bundle a textual step and a coordinate n = <s, (x,y)>. Back-propagated correctness scores steer the search toward diverse yet successful grounded chains. Root-to-leaf paths are then linearised into 30 k high-quality training traces rich in exploration, verification and back-tracking behaviour—signals that are essentially absent from base VLMs and crowdsourced CoT datasets.

Stage 2 · Grounded Reinforcement Learning

Starting from the warm-started policy π_θ₀, we optimise with Group Relative Policy Optimisation (GRPO). The reward blends task success with a stringent format bonus: traces receive credit only if every reasoning step includes a valid coordinate and all tags (e.g. <think>, <answer>) are well-formed. The resulting model

The resulting model

examines 2 × more regions than vanilla GRPO,
sets grounded sub-goals 15 × more often,
back-tracks when its current hypothesis fails.

These behaviors translate into strong performance:

+12.9 pts over vanilla GRPO on SAT-2,
+6.1 pts over direct SFT on ScreenSpot-Pro,
+3.0 pts over SOTA 7B on VisualWebArena,

Multi-Turn RL with Zoom-In Feedback

Fine-grained cues (e.g. small GUI text, background figure) are blurred in the global view. We therefore let the model emit a <tool_call name="crop"> after each coordinate prediction; the environment replies with a high-resolution crop centred on that point. GRPO on these dialog-style traces further lifts accuracy on small-element localisation tasks (+2.4 % ScreenSpot-Pro-LR) and pushes V*Bench → 86.4 %.

Why Does Grounding Matter?

Grounded RL amplifies the visual behaviours that correlate with out-of-distribution generalisation: wider region exploration, explicit visual verification, and robust back-tracking. The result is a compact open-weight model that matches—or beats—proprietary giants on spatial reasoning, web grounding, and fine-grained visual search, all while producing interpretable, clickable coordinates at every step.

Citation

@article{sarch2025vigorl,
    title={Grounded Reinforcement Learning for Visual Reasoning},
    author={Sarch, Gabriel and Saha, Snigdha and Khandelwal, Naitik and Jain, Ayush and Tarr, Michael J and
    Kumar, Aviral and Fragkiadaki, Katerina},
    year={2025}
}