Transformer built from first principles
I built and trained a small transformer from scratch to understand the language-model pipeline below the framework gloss: tokenizer, attention stack, training loop, loss curves, and the small decisions that make a model actually learn.
Role
Builder
System Surface
Tokenizer, attention, training
Stack
Python and PyTorch
Training Setup
Mixed precision on 2M samples
The model reached a much cleaner training curve after the tokenizer and attention stack stabilized, reducing loss from 9.4 to 2.4. The point was not to compete with frontier systems; it was to build the machinery closely enough to reason about it.
Build Period
2025 Project Cycle
Project Status
Public project
Primary Focus
Model mechanics
Output Format
Transformer pipeline