u/EigenMog

Includes the core transformer pipeline:

- tiled GEMM kernels

- fused attention + softmax kernels

- multi-head causal self-attention

- transformer blocks + MLPs

- KV cache + autoregressive token generation, etc.

Also built the runtime around it:
- weight loading, tensor routing, CUDA memory management, generation loop, profiling, benchmarking, etc.

- Current peak throughput is around ~190 tokens/sec on GPT-2.

Everything was profiled and tested on my RTX 3050 Laptop GPU with only 4GB VRAM.

Definitely not the fastest implementation possible and there’s still a lot that could be improved, but this project was mainly about learning CUDA, transformer inference, profiling, and GPU systems properly from scratch.

repo for more details:

https://github.com/Mog9/gpt2-inference

Built a GPT-2 inference engine from scratch in CUDA.