nanochat FAQ

Common questions about nanochat—cost, hardware, training, and usage.

How much does it cost to train a GPT-2 grade model?

~$73–75 on an 8×H100 node at ~$24/hour for ~3 hours. The leaderboard tracks the fastest runs.

Can I run on a single GPU?

Yes. Omit torchrun and the code switches to gradient accumulation. Training takes ~8× longer but produces similar results.

What if my GPU has less than 80GB VRAM?

Reduce --device_batch_size (e.g., 16, 8, 4, 2, or 1) until it fits. See getting started.

Does it run on CPU or Apple Silicon?

Yes. runs/runcpu.sh shows a minimal example. The model is shrunk—you won't get strong results, but it runs. See training.

What is the CORE metric?

From the DCLM paper. It's the primary benchmark. GPT-2 (1.6B) target: 0.256525. nanochat evaluates it in nanochat/core_eval.py. See research.

Where can I try nanochat without training?

nanochat-ai.com — try it without training.

Where is the source code?

GitHub - karpathy/nanochat

How do I ask questions?

Use the Discussions tab on GitHub, or join the #nanochat channel on Discord.

What's the model size?

The speedrun produces a ~4e19 FLOPs capability model—small enough to run on modest hardware. A 561M-parameter variant can run on devices like Raspberry Pi. Model size scales with --depth; deeper models are larger and more capable.

Can I use a different tokenizer?

Yes. Train your own with scripts/tok_train.py. The default setup uses a BPE tokenizer trained on the pretraining data. See tokenization for details.