
Built a GPT-2 inference engine from scratch in CUDA.
Includes the core transformer pipeline:
- tiled GEMM kernels
- fused attention + softmax kernels
- multi-head causal self-attention
- transformer blocks + MLPs
- KV cache + autoregressive token generation, etc.
Also built the runtime around it:
- weight loading, tensor routing, CUDA memory management, generation loop, profiling, benchmarking, etc.
- Current peak throughput is around ~190 tokens/sec on GPT-2.
Everything was profiled and tested on my RTX 3050 Laptop GPU with only 4GB VRAM.
Definitely not the fastest implementation possible and there’s still a lot that could be improved, but this project was mainly about learning CUDA, transformer inference, profiling, and GPU systems properly from scratch.
repo for more details:
u/EigenMog — 4 days ago