GRANT

Abstract

Task scheduling has become increasingly critical for embodied AI, where agents need to follow natural language instructions and execute actions efficiently in 3D physical worlds. Existing datasets for task planning in 3D environments often simplify the problem, lacking operations research knowledge for task scheduling and 3D grounding for real-world applications. In this work, we propose Operation Research Knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires synerization of language understanding, 3D grounding and efficiency optimization for embodied agents. ORS3D reflects real-world demands by requiring agents to generate efficient, step-by-step schedules that are grounded in 3D space. To facilitate research on ORS3D, we construct a large-scale dataset called ORS3D-60K, comprising 60K tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on the ORS3D-60K dataset validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency.

Contributions

We introduce Operations Research Knowledge-based 3D Grounded Task Scheduling (ORS3D), a new and practical task that meets the common requirement of embodied agents to efficiently complete tasks in the physical world.
To support ORS3D, we construct a large-scale dataset, ORS3D-60K, comprising 60,825 composite tasks across 4,376 real-world scenes. To the best of our knowledge, we are the first to incorporate OR knowledge for task scheduling in 3D scenarios.
We propose GRANT, an embodied MLLM with a simple yet effective scheduling token mechanism, integrating task scheduling with multimodal understanding to generate efficient, grounded task execution schedules. Our approach yields a significant 30.53% improvement in task completion time efficiency compared to the baseline method.

Examples from the ORS3D-60K Dataset

Here we present a few examples from the ORS3D-60K dataset using a 3D data explorer. Each composite task consists of several subtasks, each with an expected duration. An agent is expected to generate a step-by-step schedule to complete the task efficiently. In addition, it needs to ground the target object in each step.

To use the data explorer, first select from the available scenes in the selection bar. The composite task requirements and their corresponding efficient step-by-step schedule will be displayed in the right column. Click on a step to visualize its target object with a green mask in the scene.
3D Visualizer Control: Left click + Drag = Rotate Right click + Drag = Translate Scroll Up/Down = Zoom In/Out

Select a scene: