Published at ICLR 2026

PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

PixelVLA is the first vision-language-action model designed to jointly support pixel-level reasoning and multimodal prompting with both text and visual inputs. It augments existing VLA backbones with pixel-aware perception, prompt-aware encoding, and continuous action prediction for more precise robotic manipulation.

Wenqi Liang1,2, Gan Sun2✉, Yao He2, Jiahua Dong3, Suyan Dai2, Ivan Laptev3, Salman Khan3,4, Yang Cong2
1University of Trento  ·  2South China University of Technology  ·  3MBZUAI  ·  4Australian National University

Why PixelVLA matters

Existing VLAs are still limited by coarse image-level perception and heavy reliance on textual instructions. PixelVLA closes that gap with pixel-aware grounding, visual-prompt conditioning, and continuous action decoding.

01
Pixel-aware understanding. The model injects multiscale pixel information into the token stream for finer spatial grounding.
02
Text + visual prompts. PixelVLA supports points, lines, regions, and masks rather than text alone.
03
Low-cost improvement. It reports 10.1%–28.7% gains over OpenVLA while using only 1.5% of OpenVLA pretraining cost.
PixelVLA teaser figure
PixelVLA moves beyond image-level control by enabling pixel-level scene understanding and flexible visual prompting for robot manipulation.
+28.7 / +10.1
Average gain over OpenVLA on Google Robot (VM / VA)
86.7
Average success on the LIBERO benchmark
160K / 6.5M
Pixel-160K episodes / image-text-action triplets
1.5%
Pretraining cost relative to OpenVLA

Overview

PixelVLA extends a VLA backbone with pixel-aware grounding, prompt-aware encoding, and a continuous action decoder, then trains it with a two-stage visuomotor instruction-tuning recipe.

The core idea is simple: instead of asking a robot to infer everything from a whole image and a sentence, PixelVLA lets the model use pixel masks and visual prompts such as points, lines, regions, and masks. This makes spatial grounding more precise and makes human-robot interaction more flexible.

To make this possible at scale, the paper introduces Pixel-160K, a pixel-annotated visuomotor instruction-tuning dataset generated with a two-stage automated annotation pipeline.

Pixel-level understanding Text + visual prompts Continuous action decoding Two-stage tuning

Spatial precision

Fine-grained pixel cues improve object-level grounding in cluttered scenes.

Better interaction

Users can indicate targets with masks, points, regions, and lines, not only text.

Stronger transfer

The gains hold across Google Robot, WidowX, and LIBERO benchmarks.

Method

PixelVLA adds three key modules on top of a vision-language backbone: a visual prompt-aware encoder, a multiscale pixel-aware encoder, and a continuous action decoder for 7D robot control.

PixelVLA architecture
Architecture overview. PixelVLA combines a visual prompt-aware encoder, a multiscale pixel-aware encoder, and a continuous action decoder on top of a vision-language backbone.
P

Visual Prompt-aware Encoder

Encodes points, lines, regions, and masks into prompt-aware embeddings while preserving geometry.

M

Multiscale Pixel-aware Encoder

Injects fine-grained pixel information from multilevel visual features into the token stream.

A

Continuous Action Decoder

Predicts continuous 7D actions from LLM hidden states, avoiding coarse discretization loss.

Pixel-160K Dataset & Annotation Pipeline

The dataset section is redesigned to avoid the large empty block: the core dataset overview is paired with appendix figures that show episode examples and the full automated annotation pipeline.

Pixel-160K is a pixel-annotated visuomotor instruction-tuning dataset built from existing robot demonstrations. It enriches robot episodes with pixel masks and visual prompts, making it possible to train VLAs for fine-grained grounding.

160K
robot manipulation episodes
6.5M
image-text-action triplets with masks and prompts

In the appendix, the paper further shows how each episode is reformulated to combine textual instruction, pixel annotations, and visual prompts inside one training example.

Pixel-160K dataset overview
Pixel-160K overview with common skills, common objects, and example initial observations.
Episode examples from Pixel-160K appendix
Appendix examples: each episode combines task instruction, robot states, observations, and visual prompts for pixel-aware tuning.
Automated annotation pipeline
Appendix pipeline: gripper-close extraction, discrete video composition, gripper detection, region proposal generation, target-object reasoning, and annotation generation.
1

Gripper-aware region proposal

The pipeline localizes gripper-close states, composes a discrete video across episodes, segments the gripper, and crops object-relevant regions.

2

Target object reasoning

An LLM parses the natural-language instruction and extracts the first object the robot needs to grasp.

3

Open-vocabulary annotation

Grounding DINO and SAM generate boxes, masks, and visual prompts such as sampled points, random lines, and outer boxes.

Training Pipeline

PixelVLA uses a two-stage visuomotor instruction-tuning procedure: first learn continuous action prediction, then strengthen pixel-level understanding with lightweight adaptation on Pixel-160K.

Stage 1

Continuous Action Training

Initialize from pretrained VLA weights, remove the prompt-aware and pixel-aware modules, freeze most of the backbone, and train the continuous action decoder with L1 regression.

Training data: Fractal + Bridge v2

Stage 2

Pixel-level Understanding Enhancement

Fine-tune the LLM with LoRA while jointly training the visual prompt-aware encoder, the multiscale pixel-aware encoder, and the continuous action decoder on Pixel-160K.

Goal: better grounding, stronger prompt responsiveness, lower training cost

Results

PixelVLA improves both zero-shot manipulation and adaptation to new robot setups across SimplerEnv and LIBERO.

SimplerEnv · Google Robot

Average success rate
OpenVLA VM
32.7
PixelVLA VM
61.4
OpenVLA VA
40.0
PixelVLA VA
50.1
Improves over OpenVLA by +28.7 on VM and +10.1 on VA.

SimplerEnv · WidowX

Average grasp / success
OpenVLA
1.0
PixelVLA
16.7
π0
27.1
PixelVLA-π0
33.8
The π0-based variant reaches 55.1 average grasp and 33.8 overall success.

LIBERO Benchmark

Average success rate
OpenVLA
76.5
TraceVLA
78.1
Dita
82.4
PixelVLA
86.7
Best average performance across four LIBERO suites, with especially strong gains on long-horizon tasks.
BenchmarkSettingBaselinePixelVLAGain
SimplerEnv Google RobotVisual MatchingOpenVLA 32.761.4+28.7
SimplerEnv Google RobotVariant AggregationOpenVLA 40.050.1+10.1
LIBEROAverage successOpenVLA 76.586.7+10.2
WidowX (π0 backbone)Average successπ0 27.133.8+6.7

Robustness takeaway

The gains are not limited to one benchmark or one robot. Pixel-aware grounding and visual prompting make the policy more reliable when perception becomes ambiguous or the environment shifts away from the nominal setup.

Ablation takeaway

The paper also shows that both ingredients matter: continuous action training improves performance, and the pixel-level enhancement stage adds another substantial boost on top of it.

BibTeX

Use the following citation for this work.

@inproceedings{liang2026pixelvla,
  title={PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model},
  author={Liang, Wenqi and Sun, Gan and He, Yao and Dong, Jiahua and Dai, Suyan and Laptev, Ivan and Khan, Salman and Cong, Yang},
  booktitle={International Conference on Learning Representations},
  year={2026}
}