Hey everyone,

I wanted to share a project I've been working on to solve prompt-bloat in long-term conversation history handling: Conversational Graph Memory (CGM-RAG).

Standard approaches (like context stuffing) append raw text transcripts to LLM prompts, leading to quadratic $O(L^2)$ attention costs and massive prefill latency. Standard RAG helps but still fills the prompt window with text.

CGM-RAG addresses this by bypassing prompt-stuffing entirely. Instead of feeding text back into the LLM context, it projects retrieved dialogue graph concepts directly into the Key-Value (KV) cache of the model.

How it Works

Retrieval Layer: Dialogue turns are embedded using all-MiniLM-L6-v2 and indexed in a 4-bit quantized vector index (TurboVec). Concept relationships (Subject-Predicate-Object) are parsed and stored in a SQLite Graph Store.
Attention Projection: We use a trainable Memory Encoder Network (MEN). The MEN takes the dense representations of retrieved turns and projects them directly into the layer-wise Key and Value dimensions corresponding to the target LLM's heads.
KV Injection: The projected states are injected directly into the model’s past_key_values dynamic cache prior to prompt evaluation.
Prefill Bypass: Because the KV cache is pre-populated, the LLM skips the heavy prefill phase (encoding history) and moves straight into autoregressive generation utilizing rectangular attention.
In-Flight KV Cache Compression: When VRAM is tight, an asynchronous background compressor groups and quantizes low-salience key-value states along the sequence dimension, using a logit KL-divergence gate to ensure generation quality is not degraded.

Comparative Benchmarks

I ran benchmarks on a laptop GPU (NVIDIA RTX A2000) using gpt2 as the base model and a simulated conversation history. Here is how it compares:

Metric	Approach A: Context Stuffing (Baseline)	Approach B: Standard RAG (Summary Stuffing)	Approach C: TurboVec KV Injection	Approach D: CGM-RAG + Compression	CGM C vs A Improvement
Input Context Tokens	220	96	21	21	-90.5% Tokens
Virtual Memory Tokens	0	0	8 (KV injected)	45 (Compressed)	Bypasses Input Window
Generation Latency	0.4995s	0.3522s	0.4467s	0.5996s	-10.6% Latency
Hardware Guards	None	None	VRAM & Thermals	VRAM, Thermals & C++ RAM	Hardware Secure

-90.5% Input Tokens: The prompt sent to the LLM contains only the immediate user turn, keeping the context window pristine.
Prefill Speedup: Eliminating the prefill phase yields a 10.6% speedup in overall generation time.
KV Compression (Approach D): Yields high sequence savings (e.g. compressing sequence from 68 to 45 positions) to prevent OOM errors on constrained devices, with compression metrics verified via KL divergence.

Workstation Protections & Visualizer

Workstation cards need guardrails. I wrote a C++ library wrapper (safety_guard.dll) to enforce:

GPU Mutex Locks: Serializes operations to prevent concurrent allocation race conditions.
Thermal Cooldowns: Rest cycles during prototype adapter training to manage heat.
VRAM Guard: Triggers cache flushes or safe crashes under 300MB free.

The project runs an interactive CLI chat shell and boots a local HTTP visualization dashboard showing the vis.js Concept Map, a Chart.js sequential PCA trajectory of conversation embeddings, log streaming, and system resource gauges.

Check out the code, scripts, and benchmark configurations: https://github.com/LovekeshAnand/Nyxen-Memory

Would love to hear your thoughts on direct KV cache injection and caching techniques!

It's all vibe coded!!!

u/Fabulous-Possible311

Bypassing prompt-stuffing with Conversational Graph Memory (CGM-RAG): Direct KV Cache Injection and in-flight compression on local GPUs

How it Works

Comparative Benchmarks

Workstation Protections & Visualizer