r/OpenSourceAI

Steno: Opensource AI powered intelligence layer for all your confidential conversations.

Hey folks, wanted to share the latest update of Steno. Steno is an opensource project for a privacy focused AI notepad that rivals Granola with the added benefit of having opensource code and keeping your data private. No cloud, no usage limits and completely free.

With v0.3.0, you now have the ability to:

Query across all your notes across time
Have diarised transcripts
Have conversational history of all your chats against notes

In our roadmap, we'll be releasing speaker diarisation and live transcription next :)

We have a great community of contributors and always looking for great people to improve and push the boundary on privacy, local LLM and opensource AI.

Codebase @ - https://github.com/ruzin/stenoai
Download @ - https://stenoai.co

u/Far_Noise_5886 — 6 hours ago

▲ 9 r/OpenSourceAI+2 crossposts

[Benchmark] Qwen3.6-27B-FP8 on One RTX 6000 Ada: Fast TTFT, 668 tok/s Peak Throughput

Detailed setup below:

---

Model

Field	Value
Model	Qwen/Qwen-3.6 27B
Hugging Face path	Qwen/Qwen3.6-27B-FP8
Quantization / dtype	FP8
Request sizing configured	8192 max tokens

---

Serving Setup

Field	Value
Engine	vLLM 0.19
Endpoint	/v1/chat/completions
Streaming	ON
Tensor parallel size	1
Data parallel size	1
GPU memory utilization	0.90
max_model_len	8192
max_num_seqs	16
Tool call parser	qwen3_coder
Reasoning parser	qwen3

Engine flags:

--tensor-parallel-size 1
--data-parallel-size 1
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--gpu-memory-utilization 0.90
--max-model-len 8192
--max-num-seqs 16

---

Hardware

Component	Configuration
GPU	1× RTX 6000 Ada
VRAM	48GB
CPU	48 vCPU
System RAM	118GB

---

Workload

Field	Value
Dataset	ShareGPT sample
Unique prompts	128
Concurrency levels	8, 12, 16
Total requests	384
Conversation shape	Multi-turn chat
Languages	en, zh, ru, th, ko, fr, pl, ja
max_model_len	8192
max output tokens per completion	1024
Temperature	0.2

---

Results Summary

• TTFT p50 avg: 0.48s

• TTFT p95 avg: 0.94s

• TPOT p50 avg: 29.2 ms/token

• Total throughput peak: 668.5 tok/s

• KV cache max: 32.67%

---

TTFT :

Metric	Avg	Max	Unit	Interpretation
p50 TTFT	0.4802	3.75	seconds	Median requests started streaming quickly.
p95 TTFT	0.9444	4.875	seconds	Most requests started under ~1 second on average.
p99 TTFT	1.074	4.975	seconds	Tail TTFT stayed controlled on average, with occasional spikes.

---

Token Throughput

Token Type	Avg	Max	Unit	Interpretation
Prompt tokens	170.4	386.9	tokens/sec	Input processing throughput.
Output tokens	161.5	314.1	tokens/sec	Decode throughput.
Total tokens	331.9	668.5	tokens/sec	Combined prefill + decode throughput.

---

Curious how others would read these numbers? Is this a good single-GPU Qwen3.6-27B performance, or is there obvious headroom I’m missing here?

u/Temporary-Owl1725 — 3 hours ago

▲ 17 r/OpenSourceAI+7 crossposts

Chimera: an open-source, self-hostable agent that runs on local models (any OpenAI-compatible endpoint) and can fuse several at once

I've been building an open-source agent (Apache-2.0) and wanted to share it here because it's designed to be fully local and self-hostable: it talks to any OpenAI-compatible endpoint, so Ollama / llama.cpp / vLLM / LM Studio all work as the backend. No cloud lock-in, your keys and data stay yours.

The core idea is LLM-Fusion: for the hard steps it can run a panel of models on the same prompt, have a judge model cross-check them (consensus / contradictions / blind spots), and a synthesizer write the final answer. Locally this is fun because you can mix a few small local models and let them cross-check each other. A cost/latency-aware router keeps easy turns on a single model so you're not paying panel latency for everything.

Beyond that it's a full agent: plan -> act -> verify-or-revert (it runs your tests and treats the result as ground truth), layered memory (SQLite + FTS recall, cross-session profile, consolidation), a governance kernel, cron/proactive jobs, MCP client + OpenAPI-to-tool import, and an isolated subagent/crew layer (parallel git worktrees with per-worker verify gates). Runs on a laptop or a $5 VPS via Docker.

Honest status: it's alpha - 463 tests, mypy --strict clean, but no production mileage yet. Local reasoning quality obviously depends on the models you point it at, so I'd genuinely love to hear which local models people find good enough to actually drive an agent loop (reliable tool use + self-correction) - that's the make-or-break for going fully local.

Repo: https://github.com/brcampidelli/chimera-agent

u/Federal-Teaching2800 — 6 hours ago

▲ 8 r/OpenSourceAI+4 crossposts

TokenMizer - a local proxy for session checkpoint/resume and graph memory across Claude, GPT, and Ollama

I've been building TokenMizer, a local proxy that sits between your editor/CLI and whatever model you're using (Claude, GPT, Ollama) and handles two things I kept re-solving by hand: session checkpoint/resume, and a graph-based memory instead of a flat transcript.

The problem: once a long agent session hits the context limit, the usual fix is summarization, and summaries lose the reasoning behind a decision, not just the decision itself. I'd see a summary saying "switched to Argon2" with no trace of why bcrypt was rejected, so the agent would re-litigate the same tradeoff two sessions later. Flat transcripts have the opposite problem: everything is kept, but nothing is prioritized, so retrieval is just recency-biased keyword luck.

What TokenMizer does differently: instead of one growing text blob, decisions, constraints, and open questions are stored as nodes with edges (this decision depends on that constraint, this question was resolved by that decision). Checkpointing snapshots that graph plus a resumable session state, so you can kill a session and pick it back up without replaying the whole history through the model again.

Where it's rough: there's no eval harness yet comparing retrieval quality against a naive flat-transcript baseline, so right now my evidence is anecdotal (my own sessions), not benchmarked. I also learned the hard way that benchmarking your own memory system by asking it questions only it can answer is circular, so I'm holding off on publishing numbers until I have an honest comparison.

Repo: github.com/Shweta-Mishra-ai/tokenmizer (I'm the author). It's a Python project, MIT licensed. If you've hit the same summarization-loses-reasoning problem, I'd be interested in how you're handling it, and PRs/issues on the eval-harness gap would genuinely help.

u/Feisty-Cranberry2902 — 8 hours ago

▲ 19 r/OpenSourceAI+1 crossposts

How bad it is for a new player to start playing now online?

I guess all of the players are beasts, right?

reddit.com

u/HallucinatedPhoenix — 13 hours ago

▲ 25 r/OpenSourceAI+4 crossposts

Open source Visual and NDT Inspection Software

Hey everyone! After months of building, I'm releasing Open3DInspection – a browser-based 3D inspection and annotation platform for the oil & gas and NDT industries.

What it does:

Multi-format 3D viewer: OBJ, FBX, PLY, LAS/LAZ, Gaussian splats all in the browser
Georeferenced annotations: Pin inspections and annotations directly on your models with lat/lon coordinates
Local-first processing: Upload and process 3D data locally—nothing leaves your computer
Self-hosted: No SaaS subscriptions, no vendor lock-in. Run it yourself.

Why I built it:

I work in oil & gas previously, and inspection workflows are painful. Teams use disconnected tools—CAD viewers, spreadsheets, marked-up PDFs. I wanted something modern that actually works with the data people already have: drone imagery, LAS point clouds, photogrammetry outputs.

Tech stack:

Frontend: React + TypeScript + Vite (dev experience is 🔥)
3D: react-three-fiber + loaders.gl (handles everything from OBJ to Gaussian splats)
Architecture: Fully client-side, no Node backend required

Currently working on:

Photogrammetry MVP pipeline (browser-side image processing)
Drone integration (evaluating DJI Neo 2)
NDT data overlay (UT, RT, MT, PT annotations)
API 510/570/653 compliance workflows

What I need:

Feedback from practitioners: Oil & gas, NDT, aviation maintenance folks—what's broken in your inspection process?
Contributors: Especially interested in photogrammetry, WebGL optimization, or domain expertise
Ideas: If you work with 3D industrial data, let me know what would actually be useful

Repo: https://github.com/zawawiAI/Open3DInspection

This is very much an early-stage open project, so expect some rough edges. But the core viewer is solid and the foundation is there for something really useful.

Would love to hear thoughts, especially from anyone doing real inspection work.

u/Ill-Equivalent7859 — 10 hours ago

▲ 172 r/OpenSourceAI+20 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 21 hours ago

▲ 41 r/OpenSourceAI+5 crossposts

If your GPU can run inference, it should be able to fine-tune too.

I spent the last few months building a new sparse fine-tuning method for MoE models called USAF.

The goal was simple: if your GPU can run inference on an MoE model, it should also be able to fine-tune it.

On my AMD RX 6750 XT (12 GB), I can fine-tune Qwen3-30B-A3B by training sparse expert weights and the router instead of adapters.

The project is completely open source under the Apache 2.0 license. I'm not trying to build a business, sell anything, or monetize it in any way—I just wanted to share something I built that I think is genuinely interesting.

GitHub: https://github.com/tsuyu122/usaf

u/tsuyu122 — 21 hours ago

▲ 17 r/OpenSourceAI+2 crossposts

I released Orion4D MetaPrompt — a ComfyUI prompt engineering suite with local Ollama support and a standalone List Constructor

Hi everyone,

I’ve been working on a cleaner way to manage prompt-building inside ComfyUI, and I just released the first public version of Orion4D MetaPrompt.

It’s a custom node suite designed to make prompt creation cleaner, faster, and much more flexible, especially when working with reusable prompt lists, local LLMs, and more complex generation workflows.

The repo currently includes:

MetaPrompt Node — a dynamic prompt builder with list loading, block chaining, drag-and-drop organization, seed modes, and random selection.
MetaPrompt Ollama Node — takes the assembled prompt and sends it to a local Ollama model for automatic prompt enhancement.
ImageToPrompt Ollama Node — local vision captioning from a connected ComfyUI image input or a batch folder scan.
List Constructor — a standalone browser utility to create, clean, label, sort, copy, import, and export prompt lists before using them in ComfyUI.

It is especially useful if you work with large prompt libraries, reusable style lists, subject/background combinations, local LLMs, or more complex generative AI workflows.

GitHub repository:
https://github.com/orion4d/Orion4D_MetaPrompt

Live List Constructor utility:
https://orion4d.github.io/Orion4D_MetaPrompt/List_Constructor/

Feedback, bug reports, tests, and ideas are very welcome — especially from people using local LLMs or large prompt libraries inside ComfyUI.

u/boulettoxx — 18 hours ago

▲ 198 r/OpenSourceAI+8 crossposts

Wait..what !? 12 AI applications running entirely on a $5 ESP32. No cloud, no internet. Universal installer + Open source Github + Huggingface available. Test it yourself.

For years, edge AI has promised intelligence everywhere. In practice, most "edge AI" still means sending data to the cloud, relying on large Linux systems, or requiring expensive accelerator hardware.

SuperESP changes that.

Built on Atome LM v2, SuperESP transforms a standard ESP32 into a tiny AI appliance capable of running twelve practical applications entirely offline.

No GPUs.

No subscriptions.

No datacenter.

Just a microcontroller that costs less than a cup of coffee.

Every claim is verifiable and tied to a script.

What SuperESP Actually Is

SuperESP is not another chatbot squeezed onto a microcontroller.

It is a collection of specialized ternary AI models designed to classify events, patterns, behaviors, and anomalies directly on the device.

The current release includes:

Agriculture monitoring

Voice commands

Motion recognition

Gesture detection

Sound event classification

Machine anomaly detection

Air quality analysis

Energy monitoring

Occupancy estimation

Wearable activity tracking

Water leak detection

Predictive maintenance

It comes also with :

+ ESP32 OS

+ Universal Installer

Check out everything :

https://github.com/TilelliLab/atome-lm

u/themoroccanship — 2 days ago

▲ 326 r/OpenSourceAI+69 crossposts

I built an open-source, self-hosted AI gateway: 237 providers (90+ free), auto-fallback combos, and a 10-engine token-compression pipeline (MIT)

Builders-welcome post with the substance up front (disclosure: I'm the maintainer). OmniRoute is a free, MIT, self-hosted AI gateway — one OpenAI-compatible endpoint over 237 providers — built around two problems: runs dying on a provider 429, and tokens bleeding on tool/log output.

One endpoint, 237 providers — 90+ of them free. You point any tool or agent at a single OpenAI-compatible endpoint (localhost:20128/v1) and it can reach 237 LLM providers without you rewriting anything. 90+ have free tiers and 11 are free forever (no card), which aggregates to ~1.6B documented free tokens/month — and that's honest, pool-deduped math (we count each shared pool once instead of inflating it; the methodology is public in the repo). There's a one-command setup-* for 13+ coding tools (Claude Code, Codex, Cursor, Cline, Roo, Kilo, Gemini CLI…), so switching your existing setup over takes seconds.

Fallback combos — so it never stops mid-task. A "combo" is a ladder of models the router walks automatically: your subscription first, then API keys, then cheap models, then free ones. When a provider returns a 500 or you hit a rate limit, it slides to the next target in milliseconds, mid-request, and your tool never even sees the error. There are 17 routing strategies (priority, weighted, round-robin, cost-optimized, auto/coding:fast…) plus three resilience layers — a per-provider circuit breaker, a per-key cooldown, and a per-model lockout — so one dead key can't take down a whole provider.

Fusion — an ensemble mode for the hard steps. Beyond simple routing, there's a fusion strategy that fans a single prompt out to a panel of different models in parallel and then has a judge model synthesize one best answer (mixture-of-agents, built in). It's cost-aware, so easy turns stay on one fast model and it only fuses when the step is worth it.

A 10-engine compression pipeline — the part most routers don't have. Every request flows through a transparent compression pass you can toggle/stack per combo. Instead of one trick, it stacks the best of the open-source ecosystem: RTK filters command/tool output (git diffs, test logs, builds) at 60–90%, Microsoft's LLMLingua-2 does ML semantic pruning, Caveman handles prose, session-dedup strips repeats across turns. Critically, code, URLs and JSON are preserved byte-perfect, and a default-on inflation guard throws the compressed version away and sends the original if compressing would actually grow the prompt — it never makes things worse. On tool-heavy sessions that's ~89% average input-token reduction (an 8k-token git diff becomes a few hundred). Full credit to every upstream project (RTK, Caveman, LLMLingua-2, Troglodita) is in the README.

Agent-native — the agent can drive the router itself. There's a built-in MCP server (95 tools across 30 audited scopes, over stdio / SSE / streamable-HTTP), plus A2A (v0.3, JSON-RPC 2.0) support. That means an agent can query providers, switch combos, read its own remaining quota and manage memory through the gateway — not just consume tokens through it.

It's 100% local (zero telemetry, AES-256-GCM at rest), MIT-licensed, has a prompt-injection guard on every LLM route, opt-in memory, and runs on npm, Docker, desktop or your phone via Termux.

For context on whether it's worth your time: it's grown to ~9.8K GitHub stars, 1,490+ forks and 280+ contributors in ~4.5 months, with 21,000+ automated tests and 1,830+ issues closed — so it's a battle-tested project, not a brand-new experiment.

npm install -g omniroute

GitHub: https://github.com/diegosouzapw/OmniRoute · Site: https://omniroute.online

Would value a critique of the routing/compression architecture from this crowd.

u/ZombieGold5145 — 2 days ago

▲ 1 r/OpenSourceAI+4 crossposts

What If Joker Became a Teacher? (AI Video)

Full video (YouTube Shorts): https://youtube.com/shorts/JgNpvPC3U5I

I created this transformation using Google Flow. The idea was to imagine Joker leaving crime behind to become a school teacher. I focused on smooth facial transformation and cinematic motion.

I'm looking for feedback on the animation quality and prompt design. If anyone is interested, I can also share the prompt I used.

u/LegPowerful7 — 1 day ago

▲ 15 r/OpenSourceAI+3 crossposts

H64LM: A 249M-parameter Mixture-of-Experts Transformer built from scratch in PyTorch

Hi everyone,

I built H64LM, a research project to better understand modern LLMs by implementing one from scratch in PyTorch.

Instead of relying on high-level training frameworks, I implemented the core components myself attention, MoE routing, normalization, and the training loop.

Features

249M-parameter Transformer
Grouped Query Attention (GQA)
Sparse Mixture-of-Experts (8 experts, Top-2 routing) with 3 auxiliary routing losses
SwiGLU, RoPE, RMSNorm
Sliding-window attention
Mixed-precision training, gradient accumulation
Custom training loop (no Trainer abstractions)
Checkpointing and resume support

The included checkpoint was trained on a subset of WikiText-103 to validate the pipeline end-to-end, not to be a strong model it's visibly overfit past epoch 10 (best val PPL ~40.5).

Known limitations are documented in the README, including batch-size-1-only generation and no true DDP (falls back to DataParallel).

GitHub: https://github.com/Haiderkhan64/H64LM

Feedback on the implementation or architecture is very welcome.

u/Loose_Literature6090 — 1 day ago

▲ 18 r/OpenSourceAI+9 crossposts

Mastyf.ai

🚀 From MCP Guardian to Mastyf.ai

What started as an open-source experiment in securing and governing AI agents through the Model Context Protocol (MCP) has evolved into something much bigger.

Today, I'm excited to share a glimpse of that journey.

The video below showcases MCP Guardian — the project that laid the foundation for what is now Mastyf.ai: a security-first platform for AI agent governance, runtime policy enforcement, observability, approval workflows, and enterprise trust.

As AI agents gain access to tools, data sources, APIs, and autonomous workflows, the challenge is no longer just building agents—it's governing them safely, transparently, and at scale.

That's the problem we're working on at Mastyf.ai.

🔹 Runtime governance for AI agents

🔹 Policy enforcement and approval workflows

🔹 Security controls for MCP ecosystems

🔹 Auditability, observability, and compliance readiness

🔹 Enterprise-grade AI control planes

We would love feedback from developers, security researchers, platform engineers, AI engineers, and enterprise architects.

Try it. Break it. Stress-test it. Tell us what we're missing.

Special thanks to everyone who contributed ideas, bug reports, feature requests, testing, and feedback along the way. Building secure AI infrastructure is a community effort, and we're just getting started.

If you're interested in AI security, agent governance, MCP, enterprise AI infrastructure, or would like to collaborate, comment below or reach out directly.

u/Puzzleheaded-Cow2725 — 1 day ago

▲ 14 r/OpenSourceAI+2 crossposts

basemind: an MCP server that indexes your repo so agents answer from signatures, not full file reads

I kept watching coding agents answer "what calls this function" by grepping, opening three files, and reading them top to bottom to find four call sites. On a big repo that eats the context window fast.

basemind indexes a repo once and answers structurally. The MCP tools return paths, line numbers, and signatures instead of file bodies, so a lookup costs a fraction of reading the source. What it exposes:

Code map (300+ languages): outline, search_symbols, find_references, find_callers, call_graph, find_implementations. An expand escape hatch pulls a single function's full body when the agent actually needs it.
Git at symbol resolution: blame_symbol, symbol_history (when a symbol's body changed), recent_changes, diff_outline.
Document RAG over 90+ formats with text extraction and OCR built in, plus semantic and full-text search.
Shared memory and an agent-to-agent comms channel (rooms, DMs, inbox) for running more than one agent on the same repo.

Runs three ways over one local index: a Claude Code plugin, a plain MCP server, or a CLI. Works with Claude Code, Codex, Cursor, Gemini CLI, Copilot CLI, OpenCode and a few others. Rust, MIT.

On token savings: it ships a heuristic counter (an outline is modelled at about 1/5 of reading the file, a caller lookup about 1/3 of grep plus read). It's an honest estimate, not a benchmark, and tools with no fair baseline (memory, git wrappers) count zero.

Honest limitations: it's an index, so it lags edits between scans. serve watches by default and there's a rescan, but a cold first scan is slower (worst case in my tests is the TypeScript compiler, 81k files, about 18s), and the git-history index costs 6 to 22% of your .git on disk.

https://github.com/Goldziher/basemind

Curious how others here are feeding repo structure to agents over MCP.

u/Goldziher — 1 day ago

▲ 2 r/OpenSourceAI+2 crossposts

Would you use a self-hosted CI/CD platform where an AI sets up your whole pipeline — and refuses to ship your leaked API keys? (idea validation, nothing to sell)

Hey everyone — before I sink into building this, I want to know if it's something you'd actually use or if I'm solving a problem only I have.

The idea: an open-source, self-hostable DevOps platform (think Dokploy-style dashboard) with its own CI/CD engine — not a wrapper around GitHub Actions — where an AI agent acts as your DevOps engineer.

The flow:

Login → connect GitHub (or GitLab), create a project
The AI scans your repo, asks you 2–3 questions (deploy target? env vars?), and builds the full pipeline in the dashboard — build, test, deploy stages. Nothing is pushed to your repo at this point.
Every pipeline includes mandatory security stages: secret scanning across all files (yes, including that API key you pasted into a .md file), dependency CVE checks, container image scanning. If it finds a leaked key, the pipeline halts and the AI opens a fix PR — removes the secret, moves it to the secret store, and reminds you it's still in Git history and needs rotating.
Only after the first pipeline run passes does it open one PR to your repo with all the generated files — Dockerfile, the pipeline config, deploy files — with the green run and staging URL linked as proof it actually works. Merge it, and from then on everything lives in your Git. Delete the platform tomorrow and you keep working configs.
After that it keeps working: failed builds get diagnosed in plain English with a fix PR instead of a red X and 4,000 log lines. Production incidents get a timeline, probable cause, and one-click rollback.

The parts I think this sub will care about:

Fully self-hostable, single docker compose up, targeting a cheap VPS. Own runners — no GitHub Actions minutes, no Git-host lock-in
BYO LLM key — Anthropic/OpenAI, or point it at local Ollama. No hidden inference bill; your code never leaves your box with a local model
Zero lock-in by design: after the first successful run, every config (including the CI/CD definition) is committed to your repo
The AI never touches prod without approval — everything is a PR or a gated action with a full audit log

I know Coolify and Dokploy exist (I use and like them) — they give you a dashboard and templates. This gives you an agent that reasons about your specific repo, enforces security by default, and maintains the setup over time. Closer to "a DevOps engineer that works for you" than "a deploy panel."

My questions for you:

Would the "proof-first PR" (config PR arrives only after a passing run) be enough for you to trust merging AI-generated configs?
Is the built-in secret/CVE scanning with auto-fix PRs genuinely valuable to you, or is that already covered in your setup?
Own CI/CD engine vs. wrapping GitHub Actions — do you care? Would you prefer your CI not depend on GitHub at all?
What's the first thing you'd be afraid it would break?

Brutal honesty welcome. If the answer is "nobody wants this," better to hear it now.

reddit.com

u/nobod____y — 1 day ago

▲ 9 r/OpenSourceAI+1 crossposts

We'll benchmark an Open weights LLM on any GPU you choose — drop your model + hardware and we'll run it.

We run HexGrid Cloud, a platform for deploying open-source models on GPUs, and we're heads-down optimizing our serving/deployment layer.

To pressure-test it we're benchmarking real models under real concurrency — and instead of guessing, we'd rather run what you actually want to see.

---

Models available for benchmarking:

Nemotron-3 Super 120B-A12B (only NVFP4)
Nemotron-3 Nano 30B A3B
Qwen-3.6 27B
Llama 3.3 70B Instruct
Gemma-4 31B
Devstral-Small-2-24B-Instruct-2512
?? (you suggest a model to us)

We're focused on chat/instruct models for now (that's what most of our users deploy), so pick one from the list above — or suggest another open-weight chat model that fits on a single H200 (141GB).

---

Hardware & quant choices:

GPU (up to H200 for this round): RTX PRO 6000 · L40S · H100 · H200
Quant: FP8 / AWQ / BF16
Context length: (8K, 32K, 64K, 128K)
What you want measured: max throughput? single-stream speed? long-context prefill?

---

We'll run the top picks and post full results — tokens/sec, TTFT, TPOT, throughput under concurrency, and cost-per-million-tokens — config and flags included so it's reproducible.

Let us know in comments.

reddit.com

u/Temporary-Owl1725 — 1 day ago

▲ 28 r/OpenSourceAI+13 crossposts

Deterministic folding for LLM agents: continuity without LLM compaction

I just open-sourced Context Warp Drive, a continuity engine for LLM agents.

Repo: https://github.com/dogtorjonah/context-warp-drive

Right now, the industry has two bad ways of dealing with long agent horizons:

Just ride the 1M-2M context window.
Use an LLM to summarize older messages ("compaction").

LLM summaries are inconsistent, they burn an extra model round-trip, they quietly drop the exact identifiers your agent needs (UUIDs, paths, hashes), and worst of all, they constantly rewrite the prefix—which trashes your provider prompt cache.

This library takes a different approach: deterministic folding.

As the agent works, older context is folded into deterministic skeletons. Instead of linearly bloating to the ceiling, the active context sawtooths—building up efficiently, then dropping back down to a clean floor without losing continuity.

Why not just use the 1M token window?

Because 95% of what an agent carries with it on a long task isn't needed right now. It's looking for the needle in the haystack, but massive context windows force it to carry all the hay.

A larger window raises the ceiling, but it doesn't move the floor where models reason best. Long-context evals keep showing the same thing—models do not use giant contexts as cleanly as the marketing numbers imply:

Lost in the Middle — models degrade when needed information is buried in the middle of long context.
RULER — large drops as context length and task complexity increase, even for models advertised as long-context.
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval — length itself hurts performance even when retrieval succeeds.
Intelligence Degradation in Long-Context LLMs — models can collapse past critical context thresholds even when input remains relevant.

By keeping the agent deterministically folding with a warm cache and a low context band, you keep it snappy, cheap, and focused. You leave the hay behind until it's actually needed.

How Context Warp Drive works:

The Rebirth Seed: The continuity package that makes the full reset possible. It carries the recent user and AI messages, what the agent was actively working on and editing, its execution plan state, preserved exact identifiers from the full trace, and episodic context from earlier work. It is not a vague summary—it is a structured, deterministic snapshot the agent can wake up from and continue seamlessly.
Cache-Hot Appending: As the agent works, older turns fold into compact bands that append onto the rebirth seed. The context builds up over time, but because the seed stays byte-identical, you pay for cheap cache reads turn after turn instead of expensive fresh inputs.
The Sawtooth Reset: You can't append forever. When measured input pressure hits your configured ceiling, the engine performs the full sawtooth—the context drops back to a fresh rebirth seed and the cycle continues from a low-context floor.
Zero-LLM Folding: Raw chat history stays preserved as the source of truth, but the model sees a deterministic compact view. Tool calls, paths, receipts, retained reasoning, and exact identifiers are all preserved without asking another model to summarize anything.
Episodic Recall: When the agent re-touches a path or concept from before the reset, the engine pages the relevant folded detail back in. The agent doesn't carry all the hay—it pulls it back when it matters.
Task Rail: I also included a portable execution primitive called TaskRail. It keeps long-horizon plan state outside the prompt: steps, progress, acceptance criteria, and serializable checkpoints. Combined with folding and rebirth seeds, the agent stays low-context while still knowing exactly where it is in a multi-step workflow.

What's in the repo:

Core folding engine, provider-agnostic across Anthropic content blocks, OpenAI-style tool_calls, and Gemini parts.
Anthropic prompt-cache breakpoint helpers to maximize read-hits.
Raw rebirth seed renderer.
Model-aware context budget resolver.
Fold recall and episodic recall (with an optional SQLite episode store).
Portable Task Rail state machine.
Gemini CLI and Codex CLI folding adapters.

There are a lot of knobs you can tune, but the core philosophy is the same: use the 1M window as safety headroom, not as the operating band.

(Not on npm yet—install from source for now.)

I've been running this in my own multi-agent orchestration stack for months and completely dropped LLM compaction. The difference is fundamental: the agent stops treating context as a giant backpack and starts treating it like a paged working set—small, hot, recoverable, and always grounded in the raw trace.

u/MusicToThyEars — 2 days ago

▲ 61 r/OpenSourceAI+40 crossposts

Ask questions across your Markdown notes using a fully local Graph RAG engine. Built for Obsidian vaults, works with any folder of Markdown files. Extracts entity-relation triples from wikilinks & YAML frontmatter, retrieves answers via hybrid search (vector + BM25 + temporal). Multilingual. No cloud. Runs on Ollama.

https://github.com/benmaster82/Kwipu

u/WritHerAI — 2 days ago

▲ 11 r/OpenSourceAI

Anyone using something other then codex/claude for coding and system tasks?

Hey everybody!

I am looking for a good alternative for codex/claude models for coding and systems tasks.

Anyone else here using something other then codex/claude for that sort of tasks?

Would love to hear your option.

Thanks

reddit.com

u/apunker — 2 days ago