RoboCasa365 Leaderboard

Benchmarking generalist robot policies on the multi-task learning benchmark

50 Tasks

11 Models Evaluated

3 Evaluation Splits

Updated 07/16/2026

Leaderboard

Rank	Policy	Overall	Atomic-Seen	Composite-Seen	Composite-Unseen	Training Config	Open Source
🥇	Xiaomi-Robotics-1	57.4	80.2%	57.1%	32.1%	Batch Size: 512 Training Steps: 120000
🥈	ABot-M0.6	46.6	79.4%	48.3%	7.9%	Batch Size: 256 Training Steps: 120000
🥉	ABot-M0.5	40.3	75.6%	37.7%	3.3%	Batch Size: 64 Training Steps: 100000
#4	RLDX-1	36.0	67.6%	27.9%	8.5%	Batch Size: 192 Training Steps: 250000	✓
#5	WorldDreamer	35.3	66.3%	26.7%	9.0%	Batch Size: 20 Training Steps: 375000	✓
#6	GR00T N1.5	23.9	50.7%	14.8%	2.7%	Batch Size: 128 Training Steps: 120000	✓
#7	GR00T N1.6	21.9	51.1%	9.4%	1.7%	Batch Size: 128 Training Steps: 120000	✓
#8	GigaWorld-Policy 0.1	20.7	44.4%	11.8%	2.9%	Batch Size: 64 Training Steps: 600000	✓
#9	π0.5	16.9	39.6%	7.1%	1.2%	Batch Size: 64 Training Steps: 75000	✓
#10	π0	14.8	34.6%	6.1%	1.1%	Batch Size: 64 Training Steps: 75000	✓
#11	Diffusion Policy	6.1	15.7%	0.2%	1.3%	Batch Size: 192 Training Steps: 250000	✓

Note: For fairness and consistency, GR00T N1.5 was re-evaluated with a 1.5x longer horizon relative to the paper’s reported results and reflects the RoboCasa 1.0.1 update

Click a policy name to open its submission details (codebase, checkpoint, training configuration, and additional information)

Disclaimer: Training configurations are shown for transparency and should not be directly compared across models due to differing architectures and training setups

1 Benchmark scope

RoboCasa365 is a large-scale benchmark for generalist robot policies spanning 365 everyday tasks across 2,500 diverse kitchen environments. This leaderboard focuses on the multi-task learning setting and includes the four baseline policy families: Diffusion Policy, π0, π0.5, and GR00T N1.5.

The current leaderboard highlights the first public comparison and will continue to grow as users submit additional models. New entries are added after submission review and verification so results remain consistent and trustworthy. Results are reported on a 50-task multi-task benchmark spanning atomic manipulation and longer-horizon composite kitchen skills.

2 How we evaluate

For this first release, the Overall score is the published average task success rate on the 50-task multi-task benchmark. It aggregates three evaluation splits: Atomic-Seen (18 tasks), Composite-Seen (16 tasks), and Composite-Unseen (16 tasks). For more details, see our benchmarking documentation.

Policies are trained on the Human300 pretraining dataset (300 tasks across 2,500 pretraining kitchens) and evaluated on the 50 target tasks in pretraining kitchens. Atomic-Seen and Composite-Seen tasks appear in pretraining, while Composite-Unseen tasks are held out from pretraining and evaluated zero-shot for generalization to novel composite tasks.

RoboCasa Leaderboard

Comparing generalist robot policies on the RoboCasa365 multi-task benchmark

RoboCasa365 Leaderboard

Leaderboard

1 Benchmark scope

2 How we evaluate