nanochat File Structure

nanochat is a minimal codebase—under 10K lines. Here's the full layout from the GitHub repo. Understanding the structure helps when you want to modify training, add datasets, or extend the pipeline.

Root

LICENSE — MIT
README.md — Project overview
pyproject.toml — Dependencies
uv.lock — Lock file
.python-version — Python version

nanochat/ (core module)

__init__.py — Package init
checkpoint_manager.py — Save/load checkpoints
common.py — Utilities
core_eval.py — CORE score (DCLM)
dataloader.py — Distributed tokenizing dataloader
dataset.py — FineWeb pretraining data
engine.py — KV-cache inference
execution.py — Python code execution tool
gpt.py — GPT Transformer
loss_eval.py — Bits per byte
optim.py — AdamW + Muon
report.py — Report utilities
tokenizer.py — BPE tokenizer
ui.html — Chat frontend

scripts/

base_eval.py — CORE, bits/byte, samples
base_train.py — Base model training
chat_cli.py — CLI chat
chat_eval.py — Chat eval tasks
chat_rl.py — Reinforcement learning
chat_sft.py — SFT training
chat_web.py — Web chat server
tok_eval.py — Tokenizer compression
tok_train.py — Tokenizer training

runs/

speedrun.sh — GPT-2 in ~3 hours
miniseries.sh — Miniseries training
scaling_laws.sh — Scaling experiments
runcpu.sh — CPU/MPS minimal example

tasks/

arc.py — Multiple choice science
common.py — TaskMixture, TaskSequence
customjson.py — Arbitrary JSONL convos
gsm8k.py — Grade school math
humaneval.py — Python coding
mmlu.py — Broad multiple choice
smoltalk.py — SmolTalk dataset
spellingbee.py — Spelling/counting

dev/

gen_synthetic_data.py — Example synthetic data for identity infusion
repackage_data_reference.py — Reference for pretraining data shard generation

tests/

tests/test_engine.py — Tests for the inference engine. Run tests with your test runner of choice.