Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation

Kuangji Zuo ^* , Gen Li ^* , Bofan Lyu , Yanshuo Lu , Boyu Ma , Shijia Han , Xinyu Zhou , Xichen Yuan , Chuhao Zhou , Jiaqi Bai , Geng Li , Jianfei Yang ^†

MARS Lab, Nanyang Technological University, Singapore

* Equal contribution. † Corresponding author.

Human gaze acts as a natural, low-effort intent interface for human-in-the-loop VLA control, guiding object-level disambiguation, part-level interaction, and dynamic intent steering.

Paper Code

Language tells the robot what to do. Gaze tells it which object or part you mean, without extra commands.

# Why language isn't enough

"Pick up the cup" is underspecified when there are three similar cups, transparent objects, or a tool whose handle, head, and neck imply different actions. Language gives the robot the task, but it often leaves the intended object, part, or target state unresolved.

Human gaze is a natural signal for this missing intent. In everyday collaboration, people disambiguate references by looking, not by issuing extra spatial commands. Gaze2Act brings that HRI cue into VLA manipulation: the user keeps language for the high-level instruction, while gaze supplies the object-level, part-level, and dynamic intent needed for grounded robot action.

# From gaze to action

Gaze2Act keeps language for the high-level task, while gaze specifies the target referent at inference time. It first uses visual foundation models to ground first-person gaze in the robot view without task-specific training or camera calibration, then injects the grounded gaze signal at both the perception level and the action level.

Cross-view gaze grounding

Uses visual foundation models to translate first-person gaze into a robot-view target mask and fixation point, without task-specific training, camera calibration, or external markers.

Perception-level gaze prompting

Renders the grounded gaze signal onto the robot observation, using object contours for coarse target selection and heatmaps for fine-grained interaction cues.

Action-level gaze conditioning

Feeds gaze-derived spatial features directly into the DiT action head through a zero-initialized decoupled cross-attention branch.

# Evaluation suite

We evaluate Gaze2Act on a real Unitree G1 across seven settings: 15 static manipulation tasks covering object-level intent, compositional intent, and part-level intent, plus a separate dynamic steering setting.

Ambiguous instances

2 or 3 near-identical objects, one target

Unseen objects

instances never seen in training

Transparent objects

no texture for a detector to grab

Compositional

object and placement target together

Subpart grasping

handle vs. head vs. neck of one tool

Part-conditioned action

the part you look at changes the action

Dynamic steering

switch the target mid-execution

# Static manipulation evaluation

For each task, we report intent accuracy, whether the policy reaches the intended target, and task success, whether the full manipulation is completed. For language-derived baselines, precise object descriptions or grounding prompts are provided to obtain the intended object before policy execution.

Static intent accuracy

Part intent is not applicable to methods without a part-specific condition. Their detector-derived object masks remain the same no matter whether the user intends the handle, head, neck, or body, so they can report task success but do not expose a part-level target signal.

Static task success

# Dynamic intent steering

Dynamic steering is trained from 57 demonstrations and evaluated on 30 target-switch trials. The robot has already started moving when the user changes intent, so the policy must revise an ongoing action rather than choose a target from rest. The setting is difficult because the cups are visually similar, the human egocentric gaze view and the robot's exocentric camera view differ substantially, nearby and distant cups appear at different image scales, and the moving hand can partially occlude the target during re-grounding.

In this setting, updating a language-derived mask is often not salient enough to redirect an action already in progress; Gaze2Act keeps the updated gaze target active through both perception-level prompting and action-level conditioning.

Success after intent switch

Success count over 30 trials where the intended cup changes during execution.

# Ablation: Why Gaze2Act Needs Both Pathways

We use Pick Bread Place Bowl and Hammer parts to separate two roles of gaze: selecting object/placement targets and preserving fine-grained contact points. Gaze prompting makes the intended region explicit in the visual input, while gaze conditioning keeps that spatial intent available during action generation. The full model performs best on both tasks, showing that the two pathways are complementary: prompting tells the policy where to look, and conditioning helps it act consistently on that intent.

BaselineGaze prompting onlyConditioning only (random init)Conditioning only (zero init)Gaze2Act full

Values are success rates converted from 60 trials per ablation setting; no error bars are shown.

# Attention analysis

On a Pick Bread Place Bowl task with distractors, the baseline attention remains spread across the scene even with a specific language instruction. Gaze2Act uses the generic outlined-object instruction, but gaze prompting marks the selected regions and action-level conditioning keeps these spatial cues active during denoising, producing a more concentrated focus over the intended object and placement target.

Attention visualization over a pick-and-place rollout. Top: baseline attention remains diffuse. Bottom: Gaze2Act concentrates attention around the gaze-grounded targets.