u/Fabulous-Possible311

Bypassing prompt-stuffing with Conversational Graph Memory (CGM-RAG): Direct KV Cache Injection and in-flight compression on local GPUs
▲ 5 r/AIMLDiscussion+1 crossposts

Bypassing prompt-stuffing with Conversational Graph Memory (CGM-RAG): Direct KV Cache Injection and in-flight compression on local GPUs

Hey everyone,

I wanted to share a project I've been working on to solve prompt-bloat in long-term conversation history handling: Conversational Graph Memory (CGM-RAG).

Standard approaches (like context stuffing) append raw text transcripts to LLM prompts, leading to quadratic $O(L^2)$ attention costs and massive prefill latency. Standard RAG helps but still fills the prompt window with text.

CGM-RAG addresses this by bypassing prompt-stuffing entirely. Instead of feeding text back into the LLM context, it projects retrieved dialogue graph concepts directly into the Key-Value (KV) cache of the model.

How it Works

  1. Retrieval Layer: Dialogue turns are embedded using all-MiniLM-L6-v2 and indexed in a 4-bit quantized vector index (TurboVec). Concept relationships (Subject-Predicate-Object) are parsed and stored in a SQLite Graph Store.
  2. Attention Projection: We use a trainable Memory Encoder Network (MEN). The MEN takes the dense representations of retrieved turns and projects them directly into the layer-wise Key and Value dimensions corresponding to the target LLM's heads.
  3. KV Injection: The projected states are injected directly into the model’s past_key_values dynamic cache prior to prompt evaluation.
  4. Prefill Bypass: Because the KV cache is pre-populated, the LLM skips the heavy prefill phase (encoding history) and moves straight into autoregressive generation utilizing rectangular attention.
  5. In-Flight KV Cache Compression: When VRAM is tight, an asynchronous background compressor groups and quantizes low-salience key-value states along the sequence dimension, using a logit KL-divergence gate to ensure generation quality is not degraded.

Comparative Benchmarks

I ran benchmarks on a laptop GPU (NVIDIA RTX A2000) using gpt2 as the base model and a simulated conversation history. Here is how it compares:

Metric Approach A: Context Stuffing (Baseline) Approach B: Standard RAG (Summary Stuffing) Approach C: TurboVec KV Injection Approach D: CGM-RAG + Compression CGM C vs A Improvement
Input Context Tokens 220 96 21 21 -90.5% Tokens
Virtual Memory Tokens 0 0 8 (KV injected) 45 (Compressed) Bypasses Input Window
Generation Latency 0.4995s 0.3522s 0.4467s 0.5996s -10.6% Latency
Hardware Guards None None VRAM & Thermals VRAM, Thermals & C++ RAM Hardware Secure
  • -90.5% Input Tokens: The prompt sent to the LLM contains only the immediate user turn, keeping the context window pristine.
  • Prefill Speedup: Eliminating the prefill phase yields a 10.6% speedup in overall generation time.
  • KV Compression (Approach D): Yields high sequence savings (e.g. compressing sequence from 68 to 45 positions) to prevent OOM errors on constrained devices, with compression metrics verified via KL divergence.

Workstation Protections & Visualizer

Workstation cards need guardrails. I wrote a C++ library wrapper (safety_guard.dll) to enforce:

  • GPU Mutex Locks: Serializes operations to prevent concurrent allocation race conditions.
  • Thermal Cooldowns: Rest cycles during prototype adapter training to manage heat.
  • VRAM Guard: Triggers cache flushes or safe crashes under 300MB free.

The project runs an interactive CLI chat shell and boots a local HTTP visualization dashboard showing the vis.js Concept Map, a Chart.js sequential PCA trajectory of conversation embeddings, log streaming, and system resource gauges.

Check out the code, scripts, and benchmark configurations: https://github.com/LovekeshAnand/Nyxen-Memory

Would love to hear your thoughts on direct KV cache injection and caching techniques!

It's all vibe coded!!!

u/Fabulous-Possible311 — 3 days ago