nanochat Research

If you're a researcher, nanochat offers scripts and benchmarks to improve micro model training. The main goal: beat GPT-2 faster. The codebase is minimal enough to iterate quickly—change something, run a d12 or d16, and see if it helped.

Key scripts

runs/scaling_laws.sh — Scaling law experiments
runs/miniseries.sh — Miniseries of models at increasing scales

See the Jan 7 miniseries v1 discussion for documentation.

Quick iteration

For ~5 min pretraining runs, train a d12 (GPT-1 sized) model:

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
  --depth=12 \
  --run="d12" \
  --model-tag="d12" \
  --core-metric-every=999999 \
  --sample-every=-1 \
  --save-every=-1

Change something, re-run, and see if it helped. Iterate on d12, d16, etc.

Approach

Use depth as the single dial of complexity. Sweep depth to get increasingly powerful models. Set data budget to compute-optimal, train a miniseries, and compare to GPT-2 and GPT-3 miniseries. Beating GPT-2 faster is the current target.

CORE metric

The CORE score (from the DCLM paper) is the primary benchmark. GPT-2 (1.6B) target: 0.256525. nanochat evaluates this in nanochat/core_eval.py. Beating this score in the least wall-clock time is the main research target—currently ~3 hours on 8×H100.

Community & discussions

See the GitHub Discussions for guides, the Jan 7 miniseries documentation, and community experiments. New ideas for speeding up time-to-GPT-2 are especially welcome.