Foundation Model Learning#
In the foundation model learning benchmark we are interested in studying foundation model training, i.e., training with our pretraining datasets, followed by fine-tuning on our target datasets. There are two phases of training:
Pretraining: pretraining on 2000+ hours of data across 300 tasks. The data comprises human datasets for 300 tasks (482 hours) and synthetic MimicGen data across 60 atomic tasks (1615 hours)
Target task fine-tuning: fine-tuning the pretrained model on human datasets across 50 tasks (193 hours). We fine-tune the model independently on three separate splits of target data:
Atomic-Seen (18 atomic tasks, also seen in pretraining)
Composite-Seen (16 composite tasks, also seen in pretraining)
Composite-Unseen (a separate set of 16 composite tasks, not seen in pretraining)
Benchmark results and checkpoints#
We perform a benchmark featruing the GR00T N1.5 algorithm. We compare pretraining only, target training only, and pretraining following by target task fine-tuning. Here is a summary of our benchmarking results (average task success rate, in %). We share the model checkpoints for reference.
| Task Type | Pretraining Only | Target Task Learning Only | Pretraining + Target Task Post-Training | ||||
|---|---|---|---|---|---|---|---|
| 10% | 30% | 100% | 10% | 30% | 100% | ||
Atomic-Seen |
41.9% | 38.7% | 50.6% | 60.6% | 56.9% | 59.1% | 68.5% |
Composite-Seen |
0.0% | 11.0% | 22.7% | 35.0% | 25.4% | 34.6% | 40.6% |
Composite-Unseen |
0.2% | 11.2% | 27.5% | 33.3% | 22.7% | 30.8% | 42.1% |
| Average | 15.1% | 21.0% | 34.3% | 43.7% | 35.9% | 42.2% | 51.1% |
Model Checkpoints#
| Model Checkpoint | Link |
|---|---|
| Pretraining Only | Link |
Target Only (100%) - Atomic-Seen |
Link |
Target Only (100%) - Composite-Seen |
Link |
Target Only (100%) - Composite-Unseen |
Link |
Pretraining + Target Post-Training (100%) - Atomic-Seen |
Link |
Pretraining + Target Post-Training (100%) - Composite-Seen |
Link |
Pretraining + Target Post-Training (100%) - Composite-Unseen |
Link |
Benchmark instructions#
GR00T#
Guidelines#
We use a batch size of 128
For the pretraining, we train for 80k steps
For target task fine-tuning, we train for 60k steps
We always evaluate the models in target scenes
Train model#
# run pretraining
python scripts/gr00t_finetune.py \
--output-dir expdata/foundation_model_learning/pretraining \
--dataset_soup pretrain_human300_mg60 \
--max_steps 80000
# target task fine-tuning: for atomic-seen, composite-seen, composite-unseen tasks
# the following three training experiments can be run in parallel
python scripts/gr00t_finetune.py \
--output-dir expdata/foundation_model_learning/target_task_finetuning/atomic_seen \
--base_model_path expdata/foundation_model_learning/pretraining/checkpoint-80000 \
--dataset_soup target_atomic_seen \
--max_steps 60000
python scripts/gr00t_finetune.py \
--output-dir expdata/foundation_model_learning/target_task_finetuning/composite_seen \
--base_model_path expdata/foundation_model_learning/pretraining/checkpoint-80000 \
--dataset_soup target_composite_seen \
--max_steps 60000
python scripts/gr00t_finetune.py \
--output-dir expdata/foundation_model_learning/target_task_finetuning/composite_unseen \
--base_model_path expdata/foundation_model_learning/pretraining/checkpoint-80000 \
--dataset_soup target_composite_unseen \
--max_steps 60000
Evaluate model#
# Evaluate pretraining model
python scripts/run_eval.py \
--model_path expdata/foundation_model_learning/pretraining/checkpoint-80000 \
--task_set atomic_seen composite_seen composite_unseen \
--split target
# evaluate target fine-tuning: atomic-seen tasks
python scripts/run_eval.py \
--model_path expdata/foundation_model_learning/target_posttraining/atomic_seen/checkpoint-60000 \
--task_set atomic_seen \
--split target
# evaluate target fine-tuning: composite-seen tasks
python scripts/run_eval.py \
--model_path expdata/foundation_model_learning/target_posttraining/composite_seen/checkpoint-60000 \
--task_set composite_seen \
--split target
# evaluate target fine-tuning: composite-unseen tasks
python scripts/run_eval.py \
--model_path expdata/foundation_model_learning/target_posttraining/composite_unseen/checkpoint-60000 \
--task_set composite_unseen \
--split target
Report evaluation results#
python gr00t/eval/get_eval_stats.py \
--dir <your-ckpt>