nanochat FAQ
Common questions about nanochat—cost, hardware, training, and usage.
How much does it cost to train a GPT-2 grade model?
~$73–75 on an 8×H100 node at ~$24/hour for ~3 hours. The leaderboard tracks the fastest runs.
Can I run on a single GPU?
Yes. Omit torchrun and the code switches to gradient accumulation. Training takes ~8× longer but produces similar results.
What if my GPU has less than 80GB VRAM?
Reduce --device_batch_size (e.g., 16, 8, 4, 2, or 1) until it fits. See getting started.
Does it run on CPU or Apple Silicon?
Yes. runs/runcpu.sh shows a minimal example. The model is shrunk—you won't get strong results, but it runs. See training.
What is the CORE metric?
From the DCLM paper. It's the primary benchmark. GPT-2 (1.6B) target: 0.256525. nanochat evaluates it in nanochat/core_eval.py. See research.
Where can I try nanochat without training?
nanochat-ai.com — try it without training.
Where is the source code?
How do I ask questions?
Use the Discussions tab on GitHub, or join the #nanochat channel on Discord.
What's the model size?
The speedrun produces a ~4e19 FLOPs capability model—small enough to run on modest hardware. A 561M-parameter variant can run on devices like Raspberry Pi. Model size scales with --depth; deeper models are larger and more capable.
Can I use a different tokenizer?
Yes. Train your own with scripts/tok_train.py. The default setup uses a BPE tokenizer trained on the pretraining data. See tokenization for details.