RoboCasa Leaderboard

Benchmarking generalist robot policies on the multi-task learning benchmark

50 Tasks
4 Models Evaluated
3 Evaluation Splits
Updated 04/04/2026

Leaderboard

Rank Policy Overall Atomic-Seen Composite-Seen Composite-Unseen
🥇 GR00T N1.5 20.0 43.0% 9.6% 4.4%
🥈 π0.5 16.9 39.6% 7.1% 1.2%
🥉 π0 14.8 34.6% 6.1% 1.1%
#4 Diffusion Policy 6.1 15.7% 0.2% 1.25%

01 Benchmark scope

RoboCasa365 is a large-scale benchmark for generalist robot policies spanning 365 everyday tasks across 2,500 diverse kitchen environments. This initial leaderboard focuses on the multi-task learning setting and compares the four baseline policy families: Diffusion Policy, π0, π0.5, and GR00T N1.5.

The current leaderboard highlights the first public comparison and will continue to grow as users submit additional models. New entries are added after submission review and verification so results remain consistent and trustworthy.

02 How we evaluate

For this first release, the Overall score is the published average task success rate on the 50-task multi-task benchmark. It aggregates three evaluation splits: Atomic-Seen (18 tasks), Composite-Seen (16 tasks), and Composite-Unseen (16 tasks).

We also display each split separately so differences between short-horizon atomic tasks and longer-horizon composite tasks remain visible instead of being hidden by a single number. More metrics and tracks can be added here as the benchmark expands.