nanochat Datasets

nanochat uses several datasets for pretraining, SFT, and evaluation. Data is handled in nanochat/dataset.py and the tasks/ directory. All datasets are either publicly available or derived from public sources.

Pretraining

FineWeb-Edu — karpathy/fineweb-edu-100b-shuffle, derived from FineWeb-Edu. ~24GB used for pretraining. Download and shard utilities are in the repo.

Chat / SFT

See tasks/smoltalk.py, tasks/arc.py, tasks/gsm8k.py.

Evaluation tasks

Tasks live in tasks/. tasks/customjson.py lets you add custom JSONL conversation data. See file structure.

Custom data

tasks/customjson.py lets you create tasks from arbitrary JSONL conversation files. dev/gen_synthetic_data.py shows synthetic data for identity. See guides.