Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings.
To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompt-aware encoder.
To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1% ~ 28.7% over OpenVLA, while requiring only 1.5% of its pretraining cost.
Overview of the PixelVLA architecture. The model integrates a visual prompt-aware encoder for processing diverse visual prompts, a multiscale pixel-aware encoder that injects pixel-level information into token embeddings, and a continuous action decoder to predict 7D robot actions.
@inproceedings{liang2026pixelvla,
title={PIXELVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model},
author={Liang, Wenqi and Sun, Gan and He, Yao and Dong, Jiahua and Dai, Suyan and Laptev, Ivan and Khan, Salman and Cong, Yang},
booktitle={International Conference on Learning Representations},
year={2026}
}