RoboCasa Leaderboard
Benchmarking generalist robot policies on the multi-task learning benchmark
Leaderboard
| Rank | Policy | Overall | Atomic-Seen | Composite-Seen | Composite-Unseen | Training Config |
|---|---|---|---|---|---|---|
| 🥇 | GR00T N1.6 | 21.9 | 51.1% | 9.4% | 1.7% |
Batch Size:
128
Training Steps:
120000
|
| 🥈 | GigaWorld-Policy 0.1 | 20.7 | 44.4% | 11.8% | 2.9% |
Batch Size:
64
Training Steps:
600000
|
| 🥉 | GR00T N1.5 * | 20.0 | 43.0% | 9.6% | 4.4% |
Batch Size:
128
Training Steps:
120000
|
| #4 | π0.5 | 16.9 | 39.6% | 7.1% | 1.2% |
Batch Size:
64
Training Steps:
75000
|
| #5 | π0 | 14.8 | 34.6% | 6.1% | 1.1% |
Batch Size:
64
Training Steps:
75000
|
| #6 | Diffusion Policy | 6.1 | 15.7% | 0.2% | 1.3% |
Batch Size:
192
Training Steps:
250000
|
* Evaluated with a horizon 33% shorter than standard
1 Benchmark scope
RoboCasa365 is a large-scale benchmark for generalist robot policies spanning 365 everyday tasks across 2,500 diverse kitchen environments. This leaderboard focuses on the multi-task learning setting and includes the four baseline policy families: Diffusion Policy, π0, π0.5, and GR00T N1.5.
The current leaderboard highlights the first public comparison and will continue to grow as users submit additional models. New entries are added after submission review and verification so results remain consistent and trustworthy. Results are reported on a 50-task multi-task benchmark spanning atomic manipulation and longer-horizon composite kitchen skills.
2 How we evaluate
For this first release, the Overall score is the published average task success rate on the 50-task multi-task benchmark. It aggregates three evaluation splits: Atomic-Seen (18 tasks), Composite-Seen (16 tasks), and Composite-Unseen (16 tasks). For more details, see our benchmarking documentation.
Policies are trained on the Human300 pretraining dataset (300 tasks across 2,500 pretraining kitchens) and evaluated on the 50 target tasks in pretraining kitchens. Atomic-Seen and Composite-Seen tasks appear in pretraining, while Composite-Unseen tasks are held out from pretraining and evaluated zero-shot for generalization to novel composite tasks.