Scaling data and models has played a pivotal role in the remarkable progress of computer vision and language. Inspired by these domains, recent efforts in robotics have similarly focused on scaling both data and model size to develop more generalizable and robust policies. However, unlike vision and language, robotics lacks access to internet-scale demonstrations across diverse robotic tasks and environments. As a result, the scale of existing datasets typically suffers from the need for manual data collection and curation. To address this problem, here we propose BLAZER, a framework that learns manipulation policies from automatically generated training data. We build on the zero-shot capabilities of LLM planners and automatically generate demonstrations for diverse manipulation tasks in simulation. Successful examples are then used to finetune an LLM and to improve its planning capabilities without human supervision. Notably, while BLAZER training requires access to the simulator's state, we demonstrate direct transfer of acquired skills to sensor-based manipulation. Through extensive experiments, we show BLAZER to significantly improve zero-shot manipulation in both simulated and real environments. Moreover, BLAZER improves on tasks outside of its training pool and enables downscaling of LLM models. Our code and data will be made publicly available.
Overview of BLAZER. Given a set of manipulation tasks \( \tau \in \mathcal{T} \), we use an LLM to automatically generate executable commands \( \mathcal{C}_\tau \) for solving \( \tau \). The resulting solutions are automatically verified by executing \( \mathcal{C}_\tau \) in a simulator and successful solutions are added to the task database \( \mathcal{D}_{\tau} \). Task databases for all training tasks \( \mathcal{T} \) are merged into \( \mathcal{D}_{\text{BLAZER}} \) and are used for supervised fine-tuning of the BLAZER LLM.
Comparison with zero-shot baselines. We report the task success rate (%) for different methods applied to the nine manipulation tasks from RLBench. With a small LLaMA-8B model fine-tuned with BLAZER, we are able to achieve the best performance. Note how LLaMA-8B with \( \text{BLAZER} \) considerably outperforms LLaMA-70B, which was used as \( \text{LLM}_\text{boot} \). This implies that \( \text{BLAZER} \) can yield LLMs that outperform their teacher models on manipulation tasks. The table highlights the best-performing method for each task in bold and the second-best method as underlined.
Real world results. We compare LLaMA-8B with \( \text{BLAZER} \) against LLaMA-70B on real-world tasks depicted. From quantitative results in the Table, we outperform the baseline, both on In-distribution tasks (similar to \( \mathcal{T} \)) and Out-of-distribution tasks, showcasing the generalization capability of \(\text{BLAZER}\).