nanochat Training Pipeline

nanochat's training flow: tokenizer training, base model pretraining, and chat model SFT. All scripts live in scripts/ and runs/. The speedrun script orchestrates the full pipeline, but you can also run each stage manually for experimentation.

Speedrun (GPT-2 in ~3 hours)

The reference way to train a GPT-2 grade model is runs/speedrun.sh. Run on an 8×H100 node:

bash runs/speedrun.sh

Example manual run (jan29 baseline):

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
  --depth=24 \
  --run=d24-jan29 \
  --model-tag=d24_jan29 \
  --device-batch-size=16 \
  --sample-every=-1 \
  --save-every=-1 \
  --core-metric-max-per-task=-1 \
  --core-metric-every=3000 \
  --target-param-data-ratio=12

Base model pretraining

Use scripts/base_train.py for pretraining. Key arguments: --depth (model size dial—more layers, larger model), --device-batch-size, --target-param-data-ratio (compute-optimal data budget). Data comes from FineWeb-Edu. The script supports distributed training via torchrun and falls back to gradient accumulation on a single GPU.

Chat model SFT

After pretraining, run scripts/chat_sft.py for supervised fine-tuning on chat data (SmolTalk, ARC, GSM8K, etc.). See guides for customizing identity and abilities.

Tokenizer training

Train a BPE tokenizer with scripts/tok_train.py. Evaluate with scripts/tok_eval.py.

Running on CPU / MPS

runs/runcpu.sh shows a minimal example for CPU or Apple Silicon (MPS). The model is dramatically shrunk so training fits in a reasonable time. You won't get strong results, but it's useful for debugging or running on machines without GPUs.

Scaling laws & research

For research: runs/scaling_laws.sh and runs/miniseries.sh. Quick iteration: train a d12 (GPT-1 sized) in ~5 min. See research.