nanochat File Structure
nanochat is a minimal codebase—under 10K lines. Here's the full layout from the GitHub repo. Understanding the structure helps when you want to modify training, add datasets, or extend the pipeline.
Root
LICENSE— MITREADME.md— Project overviewpyproject.toml— Dependenciesuv.lock— Lock file.python-version— Python version
nanochat/ (core module)
__init__.py— Package initcheckpoint_manager.py— Save/load checkpointscommon.py— Utilitiescore_eval.py— CORE score (DCLM)dataloader.py— Distributed tokenizing dataloaderdataset.py— FineWeb pretraining dataengine.py— KV-cache inferenceexecution.py— Python code execution toolgpt.py— GPT Transformerloss_eval.py— Bits per byteoptim.py— AdamW + Muonreport.py— Report utilitiestokenizer.py— BPE tokenizerui.html— Chat frontend
scripts/
base_eval.py— CORE, bits/byte, samplesbase_train.py— Base model trainingchat_cli.py— CLI chatchat_eval.py— Chat eval taskschat_rl.py— Reinforcement learningchat_sft.py— SFT trainingchat_web.py— Web chat servertok_eval.py— Tokenizer compressiontok_train.py— Tokenizer training
runs/
speedrun.sh— GPT-2 in ~3 hoursminiseries.sh— Miniseries trainingscaling_laws.sh— Scaling experimentsruncpu.sh— CPU/MPS minimal example
tasks/
arc.py— Multiple choice sciencecommon.py— TaskMixture, TaskSequencecustomjson.py— Arbitrary JSONL convosgsm8k.py— Grade school mathhumaneval.py— Python codingmmlu.py— Broad multiple choicesmoltalk.py— SmolTalk datasetspellingbee.py— Spelling/counting
dev/
gen_synthetic_data.py— Example synthetic data for identity infusionrepackage_data_reference.py— Reference for pretraining data shard generation
tests/
tests/test_engine.py — Tests for the inference engine. Run tests with your test runner of choice.