nanochat Training Pipeline
nanochat's training flow: tokenizer training, base model pretraining, and chat model SFT. All scripts live in scripts/ and runs/. The speedrun script orchestrates the full pipeline, but you can also run each stage manually for experimentation.
Speedrun (GPT-2 in ~3 hours)
The reference way to train a GPT-2 grade model is runs/speedrun.sh. Run on an 8×H100 node:
bash runs/speedrun.sh
Example manual run (jan29 baseline):
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=24 \
--run=d24-jan29 \
--model-tag=d24_jan29 \
--device-batch-size=16 \
--sample-every=-1 \
--save-every=-1 \
--core-metric-max-per-task=-1 \
--core-metric-every=3000 \
--target-param-data-ratio=12
Base model pretraining
Use scripts/base_train.py for pretraining. Key arguments: --depth (model size dial—more layers, larger model), --device-batch-size, --target-param-data-ratio (compute-optimal data budget). Data comes from FineWeb-Edu. The script supports distributed training via torchrun and falls back to gradient accumulation on a single GPU.
Chat model SFT
After pretraining, run scripts/chat_sft.py for supervised fine-tuning on chat data (SmolTalk, ARC, GSM8K, etc.). See guides for customizing identity and abilities.
Tokenizer training
Train a BPE tokenizer with scripts/tok_train.py. Evaluate with scripts/tok_eval.py.
Running on CPU / MPS
runs/runcpu.sh shows a minimal example for CPU or Apple Silicon (MPS). The model is dramatically shrunk so training fits in a reasonable time. You won't get strong results, but it's useful for debugging or running on machines without GPUs.
Scaling laws & research
For research: runs/scaling_laws.sh and runs/miniseries.sh. Quick iteration: train a d12 (GPT-1 sized) in ~5 min. See research.