Towards Generalizable Robotic Manipulation in Dynamic Environments
Abstract
Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks.
Qualitative Demonstrations
DOMINO Dataset
DOMINO is a large-scale dataset tailored for generalizable dynamic manipulation. It features 35 diverse dynamic tasks across 5 distinct robot embodiments and provides over 110K expert trajectories. The tasks are organized into a three-tiered difficulty hierarchy progressing from predictable low-order dynamics to stochastic and abrupt dynamics.
PUMA Architecture
We propose the Predictive Unified Manipulation Architecture (PUMA) to address the spatiotemporal challenges of dynamic environments. PUMA integrates scene-centric historical optical flow to capture motion cues and employs specialized world queries to implicitly forecast object-centric future states. This design endows the model with a dynamic understanding of the physical world for anticipatory interactions.
Performance
PUMA achieves state-of-the-art performance on the DOMINO benchmark and yields a 6.3% absolute improvement in success rate over strong baselines. PUMA demonstrates substantial performance gains on highly challenging dynamic tasks where existing methods struggle. This highlights its robustness in handling complex object dynamics.
Generalization of Dynamic Data
Training on dynamic data fosters generalizable spatiotemporal representations. Exposure to dynamic interactions mitigates overfitting to static positional biases and enables effective zero-shot transfer to static environments. Co-training with static data maximizes dynamic manipulation performance by combining stable foundational priors with reactive dexterity.
More Details
Please refer to the full paper for detailed discussions and comprehensive experimental results.
BibTeX
@article{fang2026towards,
title={Towards Generalizable Robotic Manipulation in Dynamic Environments},
author={Fang, Heng and Li, Shangru and Wang, Shuhan and Xi, Xuanyang and Liang, Dingkang and Bai, Xiang},
journal={arXiv preprint arXiv:2603.15620},
year={2026}
}