RoboCasa Leaderboard

Benchmarking generalist robot policies on the multi-task learning benchmark

50 Tasks
6 Models Evaluated
3 Evaluation Splits
Updated 05/16/2026

Leaderboard

Rank Policy Overall Atomic-Seen Composite-Seen Composite-Unseen Training Config
🥇 GR00T N1.6 21.9 51.1% 9.4% 1.7%
Batch Size: 128
Training Steps: 120000
🥈 GigaWorld-Policy 0.1 20.7 44.4% 11.8% 2.9%
Batch Size: 64
Training Steps: 600000
🥉 GR00T N1.5 * 20.0 43.0% 9.6% 4.4%
Batch Size: 128
Training Steps: 120000
#4 π0.5 16.9 39.6% 7.1% 1.2%
Batch Size: 64
Training Steps: 75000
#5 π0 14.8 34.6% 6.1% 1.1%
Batch Size: 64
Training Steps: 75000
#6 Diffusion Policy 6.1 15.7% 0.2% 1.3%
Batch Size: 192
Training Steps: 250000

* Evaluated with a horizon 33% shorter than standard

Click a policy name to open its submission details (codebase, checkpoint, and additional information)

1 Benchmark scope

RoboCasa365 is a large-scale benchmark for generalist robot policies spanning 365 everyday tasks across 2,500 diverse kitchen environments. This leaderboard focuses on the multi-task learning setting and includes the four baseline policy families: Diffusion Policy, π0, π0.5, and GR00T N1.5.

The current leaderboard highlights the first public comparison and will continue to grow as users submit additional models. New entries are added after submission review and verification so results remain consistent and trustworthy. Results are reported on a 50-task multi-task benchmark spanning atomic manipulation and longer-horizon composite kitchen skills.

2 How we evaluate

For this first release, the Overall score is the published average task success rate on the 50-task multi-task benchmark. It aggregates three evaluation splits: Atomic-Seen (18 tasks), Composite-Seen (16 tasks), and Composite-Unseen (16 tasks). For more details, see our benchmarking documentation.

Policies are trained on the Human300 pretraining dataset (300 tasks across 2,500 pretraining kitchens) and evaluated on the 50 target tasks in pretraining kitchens. Atomic-Seen and Composite-Seen tasks appear in pretraining, while Composite-Unseen tasks are held out from pretraining and evaluated zero-shot for generalization to novel composite tasks.