SuSIE: Subgoal Synthesis via Image Editing

Kevin Black* Mitsuhiko Nakamoto* Pranav Atreya Homer Walke Chelsea Finn Aviral Kumar Sergey Levine

*equal contribution

SuSIE makes generalizable robotic manipulation as easy as finetuning an open-source image-editing model, such as InstructPix2Pix. Given an image and a language command, SuSIE executes the command by "editing" the image into a meaningful subgoal and then achieving that subgoal using a low-level goal-reaching policy. Much like a person first constructs a high-level plan to complete a task before deferring to muscle memory for the low-level control, SuSIE first leverages a simple image-generative model for visuo-semantic reasoning before deferring to a low-level policy to determine precise motor actuations.

We find that this recipe enables significantly higher generalization and precision than conventional language-conditioned policies. SuSIE achieves state-of-the-art results on the simulated CALVIN benchmark, and also demonstrates robust performance on real-world manipulation tasks, beating RT-2-X as well as an oracle policy that gets access to ground-truth goal images.

SuSIE alternates generating subgoals using an image-editing diffusion model and executing those subgoals using a language-agnostic low-level policy.

Zero-Shot Manipulation

SuSIE demonstrates state-of-the-art performance in the zero-shot setting of the CALVIN benchmark, where the policy is trained on 3 environments (A, B, and C) and tested on a fourth. SuSIE also shows incredibly strong performance in the real world. Scene C is situated on top of a table with a tiled surface, which is unlike anything seen in the training data, and requires manipulating 4 objects, 3 of which are unseen in the data. This means that the robot must identify the novel objects from the language instruction alone while ignoring its affinity for the object that is heavily represented in its training data. SuSIE outperforms all of the baseline methods in all of the scenes, including RT-2-X, which is a 55 billion parameter vision-language-action model trained on a 20x larger superset of SuSIE's robot training data.

Enhanced Precision

We were surprised to find that SuSIE did not just help with zero-shot generalization, but also seemed to help significantly with low-level precision. To put this to the test, we evaluated an Oracle GCBC baseline, where we trained a goal-conditioned policy to reach any state in a trajectory and then provided that policy with ground-truth goal images at test time. Unlike SuSIE, this method does not need to interpret the language instruction or otherwise do any semantic reasoning. It uses the exact same policy architecture and hyperparameters as the low-level policy in SuSIE, meaning any remaining advantage from SuSIE must come from the way it breaks down the task hierarchically.

Quantitatively, Oracle GCBC is one of the strongest baselines. However, it is still outperformed by SuSIE on average, and fails to precisely manipulate objects in many scenarios where SuSIE succeeds. For instance, SuSIE is the only method that can consistently grasp the yellow bell pepper, which is light, smooth, and almost as wide as the gripper. In "put the marker on the towel," observe how the low-level policy cannot grasp the marker until SuSIE proposes a subgoal that clearly demonstrates the correct grasp height. In "put the coffee creamer on the plate," watch as both methods make the same mistake (early dropping); however, SuSIE easily gets back on track thanks to subgoal guidance.

Leveraging Video Data

Since the image-editing model in SuSIE does not require robot actions, it can leverage broad video for even greater zero-shot generalization. In addition to the robot demonstrations in BridgeData, we also trained SuSIE on the Something-Something dataset, which contains 220,847 labeled video clips of humans manipulating various objects. Here, we present an ablation showing that the video data does indeed help, especially in the zero-shot setting. We hope that future work will explore scaling up SuSIE to even larger and broader video datasets.