RoboCasa365 Leaderboard

Benchmarking generalist robot policies on the multi-task learning benchmark

50 Tasks
7 Models Evaluated
3 Evaluation Splits
Updated 05/23/2026

Leaderboard

Rank Policy Overall Atomic-Seen Composite-Seen Composite-Unseen Training Config
🥇 RLDX-1 33.2 63.0% 27.5% 5.4%
Batch Size: 192
Training Steps: 250000
🥈 GR00T N1.5 23.9 50.7% 14.8% 2.7%
Batch Size: 128
Training Steps: 120000
🥉 GR00T N1.6 21.9 51.1% 9.4% 1.7%
Batch Size: 128
Training Steps: 120000
#4 GigaWorld-Policy 0.1 20.7 44.4% 11.8% 2.9%
Batch Size: 64
Training Steps: 600000
#5 π0.5 16.9 39.6% 7.1% 1.2%
Batch Size: 64
Training Steps: 75000
#6 π0 14.8 34.6% 6.1% 1.1%
Batch Size: 64
Training Steps: 75000
#7 Diffusion Policy 6.1 15.7% 0.2% 1.3%
Batch Size: 192
Training Steps: 250000

Note: For fairness and consistency, GR00T N1.5 was re-evaluated with a 1.5x longer horizon relative to the paper’s reported results and reflects the RoboCasa 1.0.1 update

Click a policy name to open its submission details (codebase, checkpoint, training configuration, and additional information)

Disclaimer: Training configurations are shown for transparency and should not be directly compared across models due to differing architectures and training setups

1 Benchmark scope

RoboCasa365 is a large-scale benchmark for generalist robot policies spanning 365 everyday tasks across 2,500 diverse kitchen environments. This leaderboard focuses on the multi-task learning setting and includes the four baseline policy families: Diffusion Policy, π0, π0.5, and GR00T N1.5.

The current leaderboard highlights the first public comparison and will continue to grow as users submit additional models. New entries are added after submission review and verification so results remain consistent and trustworthy. Results are reported on a 50-task multi-task benchmark spanning atomic manipulation and longer-horizon composite kitchen skills.

2 How we evaluate

For this first release, the Overall score is the published average task success rate on the 50-task multi-task benchmark. It aggregates three evaluation splits: Atomic-Seen (18 tasks), Composite-Seen (16 tasks), and Composite-Unseen (16 tasks). For more details, see our benchmarking documentation.

Policies are trained on the Human300 pretraining dataset (300 tasks across 2,500 pretraining kitchens) and evaluated on the 50 target tasks in pretraining kitchens. Atomic-Seen and Composite-Seen tasks appear in pretraining, while Composite-Unseen tasks are held out from pretraining and evaluated zero-shot for generalization to novel composite tasks.