▲ 18 r/Vllm

I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.

Hey everyone,

I’ve been using stuff like vLLM and TensorRT-LLM for a while, but I realized I didn't actually understand how they were squeezing so much performance out of GPUs. To figure it out, I decided to just build one myself.

I put together an open-source project called TokenForge. It’s not another API wrapper or chatbot UI—it’s a barebones inference engine built from the ground up in PyTorch and CUDA just to test out how all these optimization tricks actually work.

Here’s what I ended up implementing:

  • Continuous Batching: Instead of waiting for a whole batch to finish, it injects new requests on the fly so the GPU is never just sitting there idle.
  • Paged KV Cache: I essentially copied the homework of PagedAttention to stop VRAM fragmentation. It pre-allocates memory blocks so you don't randomly run out of memory.
  • Custom Kernels: Standard PyTorch was bottlenecking things, so I wrote some raw C++/CUDA and Triton kernels (including a custom Flash-Attention setup).
  • Speculative Decoding: Paired the main model with a tiny draft model to guess tokens ahead of time and speed up generation.
  • A Live Dashboard: Hooked everything up to a FastAPI backend with a UI that tracks VRAM, power draw, and tokens/sec in real-time.

Honestly, this was mostly a massive learning project for me to understand memory bandwidth and GPU scheduling. If you're curious about how LLMs actually run at a low level, or just want to roast my C++ code, I'd love some feedback!

https://github.com/prathamsingh404/TokenForge-GPU-Accelerated-LLM-Inference-Research-Platform

reddit.com
u/Top-Ear-1161 — 9 days ago
▲ 9 r/LLM

I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.

Hey everyone,

I’ve been using stuff like vLLM and TensorRT-LLM for a while, but I realized I didn't actually understand how they were squeezing so much performance out of GPUs. To figure it out, I decided to just build one myself.

I put together an open-source project called TokenForge. It’s not another API wrapper or chatbot UI—it’s a barebones inference engine built from the ground up in PyTorch and CUDA just to test out how all these optimization tricks actually work.

Here’s what I ended up implementing:

  • Continuous Batching: Instead of waiting for a whole batch to finish, it injects new requests on the fly so the GPU is never just sitting there idle.
  • Paged KV Cache: I essentially copied the homework of PagedAttention to stop VRAM fragmentation. It pre-allocates memory blocks so you don't randomly run out of memory.
  • Custom Kernels: Standard PyTorch was bottlenecking things, so I wrote some raw C++/CUDA and Triton kernels (including a custom Flash-Attention setup).
  • Speculative Decoding: Paired the main model with a tiny draft model to guess tokens ahead of time and speed up generation.
  • A Live Dashboard: Hooked everything up to a FastAPI backend with a UI that tracks VRAM, power draw, and tokens/sec in real-time.

Honestly, this was mostly a massive learning project for me to understand memory bandwidth and GPU scheduling. If you're curious about how LLMs actually run at a low level, or just want to roast my C++ code, I'd love some feedback!

https://github.com/prathamsingh404/TokenForge-GPU-Accelerated-LLM-Inference-Research-Platform

reddit.com
u/Top-Ear-1161 — 9 days ago
▲ 1 r/QuantumComputing+1 crossposts

I built a Hybrid Classical-Quantum Transformer: Replacing Attention Heads with Variational Quantum Circuits

I’ve been working on a structural overhaul of how we approach NLP by merging classical Transformer architectures with quantum geometry. I just released v2.0 of my Quantum-Enhanced Sentiment Analysis Engine, and I wanted to share the architecture and some of the key upgrades with this community.
https://github.com/prathamsingh404/Fine_Tuning_LLM-with-help-of-Quantum-VQC

u/Top-Ear-1161 — 9 days ago