RoboCasa Leaderboard
Benchmarking generalist robot policies on the multi-task learning benchmark
Leaderboard
| Rank | Policy | Overall | Atomic-Seen | Composite-Seen | Composite-Unseen |
|---|---|---|---|---|---|
| 🥇 | GR00T N1.5 | 20.0 | 43.0% | 9.6% | 4.4% |
| 🥈 | π0.5 | 16.9 | 39.6% | 7.1% | 1.2% |
| 🥉 | π0 | 14.8 | 34.6% | 6.1% | 1.1% |
| #4 | Diffusion Policy | 6.1 | 15.7% | 0.2% | 1.25% |
01 Benchmark scope
RoboCasa365 is a large-scale benchmark for generalist robot policies spanning 365 everyday tasks across 2,500 diverse kitchen environments. This initial leaderboard focuses on the multi-task learning setting and compares the four baseline policy families: Diffusion Policy, π0, π0.5, and GR00T N1.5.
The current leaderboard highlights the first public comparison and will continue to grow as users submit additional models. New entries are added after submission review and verification so results remain consistent and trustworthy.
02 How we evaluate
For this first release, the Overall score is the published average task success rate on the 50-task multi-task benchmark. It aggregates three evaluation splits: Atomic-Seen (18 tasks), Composite-Seen (16 tasks), and Composite-Unseen (16 tasks).
We also display each split separately so differences between short-horizon atomic tasks and longer-horizon composite tasks remain visible instead of being hidden by a single number. More metrics and tracks can be added here as the benchmark expands.