DeepSeek-style model from scratch
I implemented a compact DeepSeek-style language model in PyTorch with Multi-Head Latent Attention, decoupled RoPE, Mixture-of-Experts routing, Multi-Token Prediction heads, simulated FP8 utilities, and config-driven training and checkpointing.
Role
Model Builder
System Surface
Attention, routing, training
Stack
Python and PyTorch
Model Surface
MLA, MoE, MTP heads
The build is a compact lab for modern language-model mechanics: latent attention, positional encoding choices, expert routing, multi-token prediction, and training checkpoints that make the model easier to inspect and iterate.
Primary Focus
Language-model mechanics
Output Format
DeepSeek-style pipeline