I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.
Hey everyone,
I’ve been using stuff like vLLM and TensorRT-LLM for a while, but I realized I didn't actually understand how they were squeezing so much performance out of GPUs. To figure it out, I decided to just build one myself.
I put together an open-source project called TokenForge. It’s not another API wrapper or chatbot UI—it’s a barebones inference engine built from the ground up in PyTorch and CUDA just to test out how all these optimization tricks actually work.
Here’s what I ended up implementing:
- Continuous Batching: Instead of waiting for a whole batch to finish, it injects new requests on the fly so the GPU is never just sitting there idle.
- Paged KV Cache: I essentially copied the homework of PagedAttention to stop VRAM fragmentation. It pre-allocates memory blocks so you don't randomly run out of memory.
- Custom Kernels: Standard PyTorch was bottlenecking things, so I wrote some raw C++/CUDA and Triton kernels (including a custom Flash-Attention setup).
- Speculative Decoding: Paired the main model with a tiny draft model to guess tokens ahead of time and speed up generation.
- A Live Dashboard: Hooked everything up to a FastAPI backend with a UI that tracks VRAM, power draw, and tokens/sec in real-time.
Honestly, this was mostly a massive learning project for me to understand memory bandwidth and GPU scheduling. If you're curious about how LLMs actually run at a low level, or just want to roast my C++ code, I'd love some feedback!
https://github.com/prathamsingh404/TokenForge-GPU-Accelerated-LLM-Inference-Research-Platform