Multimodal and reasoning LLMs supersize training data for dexterous robotic tasks
For robots, simulation is a great teacher for learning long-horizon (multi-step) tasks — especially compared to how long it takes to collect real-world training data.
Simulating digital actions to teach robots new tasks is also time-consuming for humans, though. Cutting those minutes in half, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) PhD student Lirui Wang and his colleagues’ new “GenSim2” framework uses multimodal and reasoning large language models (LLMs that process and produce text, images, and other media) to supersize training data for robots. The researchers combined the powers of multimodal LLM GPT-4V (which can draw better inferences about text and images) and reasoning LLM OpenAI o1 (which can “think” before answering) to take ten real-world videos of tasks and generate 100 new, simulated action videos.
GenSim2 can then convert task names into task descriptions and then to task code, which can be simulated into a sequence of actions for a robot to execute. The approach could eventually assist home robots with tasks like figuring out each step needed to reheat your breakfast, including opening a microwave and placing bread in a toaster. It could also help in manufacturing and logistics settings one day, where a machine may need to transport new materials in several steps.
This framework is a sequel to Wang’s earlier work, “GenSim,” which used LLMs to encode new pick-and-place tasks for robots. He wanted to expand his approach to more dexterous activities with more complex object categories, like opening a box or closing a safe.
“To plan these more complicated chores in robotics, we need to figure out how to solve them,” says Wang. “This planning problem was not present in GenSim, since the tasks were much simpler, so we only needed “blind” LLMs. With GenSim2, we integrated the logic model GPT-4V, which teaches multimodal models to “see” by analyzing image inputs with better reasoning skills. Now, we can code the simulation task, and then generate plans in seconds.”
The nuts and bolts of GenSim2
First, you prompt an LLM like GPT-4 to generate a novel task plan like “place a ball in a box,” including images, assets, and keypoints (or specific points in an image). From there, GPT-4V reviews these details and concisely encodes which poses and actions are needed to execute the task. Humans can provide feedback about this plan to GPT-4V, and then it will refine its outline. Finally, a motion planner simulates those actions into videos, generating new training data for the robot.
To convert these plans into actions, the researchers also designed a new architecture called the “proprioceptive point-cloud transformer” (PPT). PPT converts language, point cloud (data points within a 3D space), and proprioception inputs into a final action sequence. This allows a robot to learn to imitate video simulations and generalize to objects it hasn’t seen before.
Lights, camera, action plan!
GenSim2’s souped-up approach generated data for 100 articulated tasks with 200 objects. Among these, the system simulated 50 long-horizon tasks, such as securing gold in a safe and preparing breakfast. Compared to the generative robotic agent and baseline “RoboGen,” GenSim2 had a 20% better success rate with generating and planning primitive tasks, while also being more reliable with long-horizon ones. The researchers note that having multimodal models that can reason about visual inputs gave them the edge.
Another intriguing finding: it only took humans about four minutes on average to verify robotic plans — half of how long it took them to design a task manually. Human efforts included labeling keypoints in the motion planner and giving feedback to help the multimodal language model improve its plans.
In real-world experiments, GenSim2 successfully helped plan tasks for a robot, like opening a laptop and closing a drawer. When it trained on both simulation and real data to develop its robotic policy, the framework had a better success rate than either one standalone. This reduces the required effort to collect large amounts of data in the real world.
While GenSim2 is a more intricate, advanced follow-up to its predecessor, the researchers note that they’d like it to plan and simulate robotic tasks with even less human intervention. Currently, it struggles to reliably create and code meaningful tasks on its own.
Wang also notes that while it’s a step forward in achieving automated task generation, the researchers intend to make the system more advanced. To do this, they plan to increase task complexity and diversity through advanced multimodal agents and generate 3D assets.
“Scaling up robot data has been a major challenge in creating generalizable robot foundation models,” says Yunzhu Li, Assistant Professor of Computer Science at Columbia University, who wasn’t involved in the paper. “GenSim2 addresses this by developing a scalable framework for data and action generation, using a combination of simulation, GPT-4, and sim-to-real transfer. I’m excited to see how this work could spark a “GPT moment” for robotics by effectively expanding the data available for robots.”
Wang wrote the paper with Tsinghua University assistant professor Huazhe Xu and PhD student Pu Hua, Shanghai Jiao Tong University professor Weinan Zhang and PhD students Minghuan Liu and Yunfeng Lin, and University of California at San Diego student researcher Annabella Macaluso. Their work was supported, in part, by Amazon and the company’s Greater Boston Tech Initiative. The researchers will present their work at the Conference on Robot Learning next month in Munich, Germany.