Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang1, Shangru Li1, Shuhan Wang1, Xuanyang Xi2, Dingkang Liang1,†, Xiang Bai1
1Huazhong University of Science and Technology, 2Huawei Technologies Co. Ltd
Project Lead
DOMINO Teaser

DOMINO introduces a large-scale benchmark for dynamic manipulation, while PUMA couples historical motion cues with future state anticipation to achieve highly reactive embodied intelligence.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks.

Qualitative Demonstrations

Robot: Aloha-AgileX Level 1 Env: Clean
Adjust Bottle
Robot: Aloha-AgileX Level 2 Env: Clean
Beat Block Hammer
Robot: Aloha-AgileX Level 3 Env: Clean
Click Alarmclock
Robot: Aloha-AgileX Level 1 Env: Random
Click Bell
Robot: Aloha-AgileX Level 2 Env: Random
Dump Bin Bigbin
Robot: Aloha-AgileX Level 3 Env: Random
Grab Roller
Robot: Franka-Panda Level 1 Env: Clean
Move Can Pot
Robot: Franka-Panda Level 2 Env: Clean
Move Playingcard Away
Robot: Franka-Panda Level 3 Env: Clean
Move Stapler Pad
Robot: Piper Level 1 Env: Clean
Place A2B Left
Robot: Piper Level 2 Env: Clean
Place A2B Right
Robot: Piper Level 3 Env: Clean
Place Mouse Pad
Robot: UR5-Wsg Level 1 Env: Clean
Place Bread Basket
Robot: UR5-Wsg Level 2 Env: Clean
Place Bread Skillet
Robot: UR5-Wsg Level 3 Env: Clean
Place Can Basket
Robot: ARX-X5 Level 1 Env: Clean
Handover Block
Robot: ARX-X5 Level 2 Env: Clean
Handover Mic
Robot: ARX-X5 Level 3 Env: Clean
Hanging Mug

DOMINO Dataset

DOMINO is a large-scale dataset tailored for generalizable dynamic manipulation. It features 35 diverse dynamic tasks across 5 distinct robot embodiments and provides over 110K expert trajectories. The tasks are organized into a three-tiered difficulty hierarchy progressing from predictable low-order dynamics to stochastic and abrupt dynamics.

DOMINO Dataset Statistics

PUMA Architecture

We propose the Predictive Unified Manipulation Architecture (PUMA) to address the spatiotemporal challenges of dynamic environments. PUMA integrates scene-centric historical optical flow to capture motion cues and employs specialized world queries to implicitly forecast object-centric future states. This design endows the model with a dynamic understanding of the physical world for anticipatory interactions.

PUMA Method

Performance

PUMA achieves state-of-the-art performance on the DOMINO benchmark and yields a 6.3% absolute improvement in success rate over strong baselines. PUMA demonstrates substantial performance gains on highly challenging dynamic tasks where existing methods struggle. This highlights its robustness in handling complex object dynamics.

Performance Table Performance Radar Chart

Generalization of Dynamic Data

Training on dynamic data fosters generalizable spatiotemporal representations. Exposure to dynamic interactions mitigates overfitting to static positional biases and enables effective zero-shot transfer to static environments. Co-training with static data maximizes dynamic manipulation performance by combining stable foundational priors with reactive dexterity.

Generalization of Dynamic Data

More Details

Please refer to the full paper for detailed discussions and comprehensive experimental results.

BibTeX

@article{fang2026towards,
      title={Towards Generalizable Robotic Manipulation in Dynamic Environments},
      author={Fang, Heng and Li, Shangru and Wang, Shuhan and Xi, Xuanyang and Liang, Dingkang and Bai, Xiang},
      journal={arXiv preprint arXiv:2603.15620},
      year={2026}
}