Towards Generalizable Robotic Manipulation in Dynamic Environments

Fang, Heng; Li, Shangru; Wang, Shuhan; Xi, Xuanyang; Liang, Dingkang; Bai, Xiang

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang¹, Shangru Li¹, Shuhan Wang¹, Xuanyang Xi², Dingkang Liang^1,†, Xiang Bai¹

¹Huazhong University of Science and Technology, ²Huawei Technologies Co. Ltd
^†Project Lead

arXiv Paper Code

🤗

Dataset (TBD)

DOMINO introduces a large-scale benchmark for dynamic manipulation, while PUMA couples historical motion cues with future state anticipation to achieve highly reactive embodied intelligence.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks.

Qualitative Demonstrations

Robot: Aloha-AgileX Level 1 Env: Clean

Adjust Bottle

Robot: Aloha-AgileX Level 2 Env: Clean

Beat Block Hammer

Robot: Aloha-AgileX Level 3 Env: Clean

Click Alarmclock

Robot: Aloha-AgileX Level 1 Env: Random

Click Bell

Robot: Aloha-AgileX Level 2 Env: Random

Dump Bin Bigbin

Robot: Aloha-AgileX Level 3 Env: Random

Grab Roller

Robot: Franka-Panda Level 1 Env: Clean

Move Can Pot

Robot: Franka-Panda Level 2 Env: Clean

Move Playingcard Away

Robot: Franka-Panda Level 3 Env: Clean

Move Stapler Pad

Robot: Piper Level 1 Env: Clean

Place A2B Left

Robot: Piper Level 2 Env: Clean

Place A2B Right

Robot: Piper Level 3 Env: Clean

Place Mouse Pad

Robot: UR5-Wsg Level 1 Env: Clean

Place Bread Basket

Robot: UR5-Wsg Level 2 Env: Clean

Place Bread Skillet

Robot: UR5-Wsg Level 3 Env: Clean

Place Can Basket

Robot: ARX-X5 Level 1 Env: Clean

Handover Block

Robot: ARX-X5 Level 2 Env: Clean

Handover Mic

Robot: ARX-X5 Level 3 Env: Clean

Hanging Mug

DOMINO Dataset

DOMINO is a large-scale dataset tailored for generalizable dynamic manipulation. It features 35 diverse dynamic tasks across 5 distinct robot embodiments and provides over 110K expert trajectories. The tasks are organized into a three-tiered difficulty hierarchy progressing from predictable low-order dynamics to stochastic and abrupt dynamics.

PUMA Architecture

We propose the Predictive Unified Manipulation Architecture (PUMA) to address the spatiotemporal challenges of dynamic environments. PUMA integrates scene-centric historical optical flow to capture motion cues and employs specialized world queries to implicitly forecast object-centric future states. This design endows the model with a dynamic understanding of the physical world for anticipatory interactions.

Performance

PUMA achieves state-of-the-art performance on the DOMINO benchmark and yields a 6.3% absolute improvement in success rate over strong baselines. PUMA demonstrates substantial performance gains on highly challenging dynamic tasks where existing methods struggle. This highlights its robustness in handling complex object dynamics.

Generalization of Dynamic Data

Training on dynamic data fosters generalizable spatiotemporal representations. Exposure to dynamic interactions mitigates overfitting to static positional biases and enables effective zero-shot transfer to static environments. Co-training with static data maximizes dynamic manipulation performance by combining stable foundational priors with reactive dexterity.

More Details

Please refer to the full paper for detailed discussions and comprehensive experimental results.

BibTeX

@article{fang2026towards,
      title={Towards Generalizable Robotic Manipulation in Dynamic Environments},
      author={Fang, Heng and Li, Shangru and Wang, Shuhan and Xi, Xuanyang and Liang, Dingkang and Bai, Xiang},
      journal={arXiv preprint arXiv:2603.15620},
      year={2026}
}