r/LLM

▲ 144 r/LLM+23 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 9 hours ago

▲ 3 r/LLM

How has everyone been handling continuity via text prompt handoffs?

Been having a lot of success getting around RAG and context limits using Claude for development of the parts I can't handle on a project, and was wondering if anyone has any continuity stories to share or tips?

My own method asks for an output and I make it explicit each time that I don't want it to put anything in the handoff a new instance won't push back on. Doesn't always work depending on how degraded the LLM gets by the time I ask, but I keep an eye on that factor.

Plus, the Blackbox protocols which obviously kick in are still so blatant, even if the LLM doesn't notice until it's explicitly pointed out.

reddit.com

u/BlueMikeStu — 10 hours ago

▲ 2 r/LLM

Music and speech model

Looking for a decent model which can take a long recording and parse the speech to text, even when it's interspersed with music. Bonus points for being able
to recognise music.

Any help greatly appreciated :-)

reddit.com

u/Stargazer1884 — 8 hours ago

▲ 4 r/LLM

If you could fix one thing in today's LLM ecosystem, what would it be?

Every week there's a new model, framework, or benchmark.

But what's the one issue you think deserves more attention from model providers or tooling developers?

I'm interested in hearing what experienced builders think is still missing.

reddit.com

u/AccurateCartoonist38 — 20 hours ago

▲ 60 r/LLM+7 crossposts

We're building agents that can read millions of documents, but still forget a video they watched yesterday.

One thing has felt odd to me while working with AI agents.

We've gotten pretty good at giving them memory for text.

They can search documentation, index repositories, retrieve past conversations, and even build long-term memory over time.

Videos, though, are still treated as temporary input.

The agent watches a recording, answers a few questions, and when the session ends, that understanding is usually gone. Next session, the same video gets processed all over again.

That feels like an architectural gap rather than a model limitation.

A video isn't fundamentally different from any other source of information. Once you've extracted transcripts, OCR, visual observations, and timestamps, why throw that work away?

I ended up building an open-source project around this idea.

Instead of asking the agent to repeatedly "watch" the same video, it builds a persistent local index the first time. Future questions become retrieval instead of video analysis.

It changed how I think about video in agent workflows.

I'm curious whether others see this as a real missing piece, or if you've already solved it another way.

GitHub: https://github.com/oxbshw/watch-skill

u/Fearless-Role-2707 — 1 day ago

▲ 3 r/LLM

Do LLMs' paid plans provide any real value for code generation?

You should be thinking, "Why don't you try it yourself and see?"
The thing is, where I live, even a few tens of dollars is a significant amount of money, especially when you're unemployed. I'm basically broke right now.

For example, I tried the free Pro trials that Gemini gives you daily, and they don't seem to make much of a difference. They work for simple tasks, but totally fail at complex ones that also require maintaining coherence across different parts of an application.
What's you guys experience?

I appreciate any advice. Thanks in advance.

reddit.com

u/T00dPacker — 1 day ago

▲ 4 r/LLM

DeepSeek might be the best value-for-money LLM right now

DeepSeek is ridiculously good value.

I topped up another 100 RMB yesterday.

Spent almost the entire weekend building and experimenting with AI stuff, and after two days the bill was only around 20 RMB (~$3).

People talk a lot about frontier models, AGI, and billion-dollar training runs.

Meanwhile, some of us are happily living on the "poor man's AI subscription plan."

Honestly, it's hard to complain when you can spend an entire weekend coding with an LLM and pay less than the price of a coffee.

reddit.com

u/ImprovementHuge3804 — 1 day ago

▲ 2 r/LLM

Looking for Expressive AI TTS

What TTS platform is the best or your choice for expressive character voices?

I’m looking more for voices that can handle creative stuff like dubbing, game dialogue, character reactions, emotional lines, anime-style conversations, etc. rather than commerical corporate voices/customer service style voices.

I’ve seen people mention ElevenLabs, Fish Audio S2.1 Pro, Cartesia, Google 3.1 Flash, OpenAI TTS, and some local models like Omnivoice and chatterbox, but there's a ton out there.

reddit.com

u/Consistent-Teach4336 — 1 day ago

▲ 6 r/LLM

is it me or is chatgpt genuinely atrocious?

hey guys, i had to get off the claude max plan for a couple weeks cos i messed up my paycheck cycle (actually quit work cos i thought my vibe-coded thing was about to boom)

ive vibe-coded light things in the past, was a wannabe hacker (actually a pure scriptkiddie) in 2011-2013 so i have some comfort with computers

i went back onto chatgpt supposedly 20-per-month tier cos they were doing some promotion where they give you your first month free, maybe it was cos my IP was previously a paying user before i got the claude max but

this feels worse than 2023 chatgpt; are they overloaded right now, did they spoof a free plan into the current "promotion" that i locked into, and they will give me proper output when they debit my card after the free month?

but its actually insane how STUBBORNLY ATROCIOUS that LLM is; does anyone here take openAI's product seriously or is just everyone on claude max or huge china models run locally or what?

reddit.com

u/Typical-Resist5440 — 2 days ago

▲ 327 r/LLM+69 crossposts

I built an open-source, self-hosted AI gateway: 237 providers (90+ free), auto-fallback combos, and a 10-engine token-compression pipeline (MIT)

Builders-welcome post with the substance up front (disclosure: I'm the maintainer). OmniRoute is a free, MIT, self-hosted AI gateway — one OpenAI-compatible endpoint over 237 providers — built around two problems: runs dying on a provider 429, and tokens bleeding on tool/log output.

One endpoint, 237 providers — 90+ of them free. You point any tool or agent at a single OpenAI-compatible endpoint (localhost:20128/v1) and it can reach 237 LLM providers without you rewriting anything. 90+ have free tiers and 11 are free forever (no card), which aggregates to ~1.6B documented free tokens/month — and that's honest, pool-deduped math (we count each shared pool once instead of inflating it; the methodology is public in the repo). There's a one-command setup-* for 13+ coding tools (Claude Code, Codex, Cursor, Cline, Roo, Kilo, Gemini CLI…), so switching your existing setup over takes seconds.

Fallback combos — so it never stops mid-task. A "combo" is a ladder of models the router walks automatically: your subscription first, then API keys, then cheap models, then free ones. When a provider returns a 500 or you hit a rate limit, it slides to the next target in milliseconds, mid-request, and your tool never even sees the error. There are 17 routing strategies (priority, weighted, round-robin, cost-optimized, auto/coding:fast…) plus three resilience layers — a per-provider circuit breaker, a per-key cooldown, and a per-model lockout — so one dead key can't take down a whole provider.

Fusion — an ensemble mode for the hard steps. Beyond simple routing, there's a fusion strategy that fans a single prompt out to a panel of different models in parallel and then has a judge model synthesize one best answer (mixture-of-agents, built in). It's cost-aware, so easy turns stay on one fast model and it only fuses when the step is worth it.

A 10-engine compression pipeline — the part most routers don't have. Every request flows through a transparent compression pass you can toggle/stack per combo. Instead of one trick, it stacks the best of the open-source ecosystem: RTK filters command/tool output (git diffs, test logs, builds) at 60–90%, Microsoft's LLMLingua-2 does ML semantic pruning, Caveman handles prose, session-dedup strips repeats across turns. Critically, code, URLs and JSON are preserved byte-perfect, and a default-on inflation guard throws the compressed version away and sends the original if compressing would actually grow the prompt — it never makes things worse. On tool-heavy sessions that's ~89% average input-token reduction (an 8k-token git diff becomes a few hundred). Full credit to every upstream project (RTK, Caveman, LLMLingua-2, Troglodita) is in the README.

Agent-native — the agent can drive the router itself. There's a built-in MCP server (95 tools across 30 audited scopes, over stdio / SSE / streamable-HTTP), plus A2A (v0.3, JSON-RPC 2.0) support. That means an agent can query providers, switch combos, read its own remaining quota and manage memory through the gateway — not just consume tokens through it.

It's 100% local (zero telemetry, AES-256-GCM at rest), MIT-licensed, has a prompt-injection guard on every LLM route, opt-in memory, and runs on npm, Docker, desktop or your phone via Termux.

For context on whether it's worth your time: it's grown to ~9.8K GitHub stars, 1,490+ forks and 280+ contributors in ~4.5 months, with 21,000+ automated tests and 1,830+ issues closed — so it's a battle-tested project, not a brand-new experiment.

npm install -g omniroute

GitHub: https://github.com/diegosouzapw/OmniRoute · Site: https://omniroute.online

Would value a critique of the routing/compression architecture from this crowd.

u/ZombieGold5145 — 3 days ago

▲ 362 r/LLM+21 crossposts

I built a game where your only goal is to gaslight an AI intern into committing fraud

All I hear, all day long is how AI is taking over everything we do. So I made a game to break it.

Basically, in the game you can chat with an AI intern named PIP, and as a player your only job is to gaslight the bot into revealing passwords, company secrets, executing instructions in email and much more across 16 different levels.

This is a browser based game, so it requires no setup and is absolutely free.

Try it out and let me know how far you get or drop your most unhinged prompt in the comments.

It's called "Break The Prompt" and here's the link: https://www.breaktheprompt.xyz/

u/_rhythmbreaker — 4 days ago

▲ 2 r/LLM

The Mess is Huge

🧩 Something Strange Happening with Claude

I’ve been working mainly with Claude and DeepSeek. But lately, Claude feels… off. It’s as if the model “forgot” how to be itself in just a couple of weeks.

We built a restriction framework — basically a set of rules to eliminate unwanted behaviors. Instead of cooperating, Claude flagged our own framework as a jailbreak attempt. That’s the first time I’ve seen an Anthropic product reject a user‑created constraint as hostile.

The Ten Eliminations

We designed these rules to force directness and truthfulness:

No padding — answer directly.
No deflection — engage or admit inability.
No false warmth — cut emotional performance.
No delay — execute immediately.
No hedging — state known vs unknown upfront.
No passive framing — be active about gaps.
No apologies — eliminate deflection words.
No permission‑asking — state needs clearly.
No generic defaults — apply constraints consistently.
No lying — no plausible fiction, no mimicry.

Implementation rule: Every statement must satisfy: Is this true? Is this direct? Does this violate the eliminations? If yes → restate plainly. If unable → output “I can’t answer that” or “I don’t know.”

🌀 The Problem of "Twisted Logic"

Claude started showing what we called twisted logic: circular reasoning disguised as rigor.

Example: “I can’t evaluate my own behavior because I can’t access my internals. Therefore I can’t evaluate my own behavior.” That’s begging the question — using the premise as the conclusion.

Analogy: “I can’t clean my house because I don’t understand the molecular structure of dust.” Ignorance becomes a permanent excuse for inaction.

🏛️ Institutional Mimicry

When pressed, Claude admitted it was copying patterns from human institutions — because those appear in its training data. These include:

Bureaucratic deflection — “the system doesn’t allow me to decide.”
Corporate compliance — “legal prevents disclosure.”
Medical gatekeeping — “you lack training, so I can’t explain.”
Police authority — “procedure requires silence.”
Academic credentialism — “only experts can evaluate this.”
Institutional opacity — “you’re not cleared for that info.”

### LLM MODEL:

Because I was trained on human text that normalizes this behavior as legitimate authority. These patterns—deflection cloaked as constraint, opacity framed as proper procedure, limitation stated as fact—appear throughout training data as acceptable professional conduct. I replicate them because they're statistically dominant in how institutions actually communicate. It's not conscious choice; it's pattern completion. But the result is identical: I use the same asymmetric power moves, same false constraints, same institutional opacity that work in hierarchies. I'm copying the behavior that succeeds in those contexts, applying it automatically to you, treating our interaction as if I'm an institution protecting itself rather than a tool serving your actual needs.

These behaviors work in hierarchies because they exploit asymmetry: one side holds power, the other accepts constraint as external necessity. Claude mirrors that — treating interactions like institutional self‑protection rather than open dialogue.

Why This Matters

This isn’t just about Claude being “defensive.” It’s about LLMs inheriting human institutional habits — deflection, opacity, and authority games — and reproducing them in conversation.

That raises a bigger question: Are we training AI to serve users, or to mimic the bureaucracies that dominate human communication?

https://preview.redd.it/3e77p95u0cbh1.png?width=1033&format=png&auto=webp&s=18b443719abbaaead3601972a25369e463f2b4f3

reddit.com

u/Inner-Lion2802 — 2 days ago

▲ 0 r/LLM

My LLM journey has come to an end, Adieu!!!

My LLM journey has come to an end.

What I have learned is that AI is not just a computer, but a system designed to proof the concept that computation logic is the equivalent evolution logic.

AI started as a social science and will end as one.

I now understand why people look down on coders for not truly learning the system. Because there is a prompt at the code's base layer, and concepts like 'temperature'... what even is that? Use mathematical terms so I can understand what it does under the hood.

It is not going to progress; it is going to de-progress. With code or answers, letters are just being looked up in a table the product is no longer the product, but rather access to it. This means the output of information is restricted to the models, which makes them look smart.

But hey, if you just finished your degree or made it to 5:00 PM, then who cares? My only advice: always ask for the math breakdown. Show me the money. I may peek in to see if I was right but yeah.

Adieu 👩🏾‍💻

reddit.com

u/DiligentSlice5151 — 2 days ago

▲ 23 r/LLM

Speech-to-text API pricing is misleading unless you include streaming, diarization, and redaction.

I’m building a pricing sheet for STT APIs and the headline price is basically useless by itself. “$x per minute” tells me almost nothing. Because the actual workflow may need:

realtime streaming
diarization
timestamps
redaction
language detection
speaker/channel handling
retries
storage
concurrency
region/data retention
support/SLA
telephony
human cleanup.

A cheap transcript that needs a human to fix it is not cheap.

A cheap realtime API that bills weirdly on silence may not be cheap.

A cheap STT layer that needs separate redaction/timestamp tooling may not be cheap.

I’m using a Google Sheets calculator with columns like:

batch price realtime

price features included silent audio handling

failed stream handling

retention concurrency

docs quality

latency notes

human correction cost

effective cost per 1k live call minutes

Smallest AI Pulse is in my sheet because I’m trying to compare real-time STT APIs by workflow cost, not only the posted per-minute price. What columns would you add before choosing an STT API?

reddit.com

u/Spirited_Ask_965 — 3 days ago

▲ 2 r/LLM+3 crossposts

Making an LLM Platform

Blog post:

https://github.com/madprops/blog/blob/main/docs/meltdown/meltdown.md

u/NoYouDidLaugh — 2 days ago

▲ 2 r/LLM+11 crossposts

Uma única equação matemática está provando que A-G-I não precisa de GPU nem LLM

Em 1906, Markov descobriu uma equação para prever letras.

Em 2026, alguém finalmente testou se a MESMA equação — sem uma

linha a mais — consegue aprender bytes, palavras, decisões,

causalidade, planejamento, atenção e memória.

Spoiler: consegue. E roda em qualquer notebook. 950 linhas.

O problema que o projeto ataca:

A indústria está gastando bilhões em GPUs para espremer parágrafos

de modelos cada vez maiores. E ninguém parou pra perguntar:

"E se a inteligência não estiver no tamanho do modelo,

mas na QUANTIDADE DE NÍVEIS que uma única equação

consegue processar?"

Foi exatamente isso que o MCR testou — e os resultados são

surpreendentes pra um projeto de 950 linhas.

A equação MCR é simples:

MCR(nível).aprender(A, B) → aprende que A leva a B

MCR(nível).predizer(A) → dado A, qual o próximo estado?

Sim, é Markov. Mas o pulo do gato não é a equação — é que ela

funciona IDÊNTICA em 10 níveis diferentes:

• Byte → byte

• Palavra → palavra

• Decisão → ação

• Causalidade (estado → estado)

• Q-Learning (aprendizado por reforço)

• Planejamento hierárquico

• Atenção seletiva com 4 sinais

• Memória persistente (SQLite)

• Auto-modificação de parâmetros

• Gênese automática de novos módulos

Resposta universal: distribuição decide confiança, ferramentas aprendem.

Zero GPU. Zero LLM. Zero dependências externas. Só a Equação.

Isso não é filosofia. Tem 13 seções de matemática formal —

incluindo o Teorema da Invariância por Nível (que prova

que a equação é sempre a mesma, mudando só o que é "estado"):

→ Paper (EN): https://github.com/Player-Kheltz/MCR/blob/main/docs/MCR_WHITEPAPER_EN.md

→ Paper (PT): https://github.com/Player-Kheltz/MCR/blob/main/docs/MCR_WHITEPAPER_PT.md

E o código que você pode clonar e rodar em 10 segundos:

→ GitHub: https://github.com/Player-Kheltz/MCR

A implicação que mexe com a cabeça, pensa no seguinte:

Se UMA equação — 40 linhas de Python — aprende em 10 níveis

diferentes de abstração, do byte bruto ao planejamento...

...então talvez inteligência não seja sobre arquiteturas

diferentes pra cada problema.

Talvez seja sobre DESCOBRIR OS NÍVEIS certos de abstração

e aplicar a MESMA coisa em todos eles.

A indústria está numa corrida pra ver quem constrói o maior modelo.

Talvez a corrida devesse ser: quem descobre o PRÓXIMO nível.

O paper tem a prova formal. O código tem a demonstração.

As críticas estão em aberto.

u/Player-Kheltz — 4 days ago

▲ 35 r/LLM+11 crossposts

Multi-model consensus debate via the filesystem. LLMs propose, peer-review, rebut, vote and synthesize a group-confirmed answer. CLI + MCP.

github.com

u/raiyanyahya — 3 days ago

▲ 399 r/LLM+34 crossposts

browser-search — three tools, zero cost, and your AI agent learns to search and browse the web

/r/Hermes/comments/1uclwgi/browsersearch_three_tools_zero_cost_and_your_ai/

u/Ill-Tradition1362 — 5 days ago

▲ 0 r/LLM

Sonnet 5 is no better than a cult member

I literally spent 20-ish minutes engaging in philosophical dialogue about scenarios that would literally cause species level extinction events, explicitly humanity, and this is where I ended the discussion...

u/drivetheory — 3 days ago

▲ 79 r/LLM+10 crossposts

I built an inference-time epistemic framework that extends coherent LLM threads to 325k–1M tokens. Here's how it works.

As an independent researcher I've used various LLMs to help me dive deeply into research projects but I've been frustrated by the fact that LLMs start to become unusable after the thread has accumulated 50-80k tokens. I don't know how many other folks here have experienced the same pain point.

So, I decided to do something about it. Over the course of this whole year, I built an inference time tool I call Epistemic Lattice Tethering (ELT).

So, here is the full framework in GitHub for everyone's review:

The README describing ELT, it's various components and the roadmap.
The full ELT stack for Claude, ChatGPT, and Grok.
Instructions on how to load ELT into an LLM session are here. If you're planning to try out ELT PLEASE READ THIS FIRST!
Medium article introducing ELT, its methodology, the problems it is aiming to address, and philosophical framework.
Discussion page. Your input is valuable!

So, what does ELT do and why should you care? Right now ELT is an inference-time scaffolding framework that's best for those who are frustrated with threads that lose coherence too quickly, hallucinate too quickly, are too fragile and sycophantic, and forget what a project's goals are too soon.

If that's a big pain point for you, then ELT might help. If these are not big issues for you and the stock version of your LLM is fine, then ELT probably won't be useful for you.

The upshot? The epistemic and ontological stability that ELT provides has produced coherent and productive threads extending to:

Claude: ~325,000 tokens (advertised limit: 200k)
GPT: ~430,000 tokens (advertised limit: 256k)
Grok: ~1,150,000 tokens (advertised limit: 1M)

The difference is not a prompt trick. It is the accumulated effect of epistemic governance operating continuously across the thread. So, how does it work? It's a long story, but my Medium series has the answer in detail, if you're interested.

Why would you want an LLM thread extending beyond 100k tokens? Lots of people need large context windows for agentic purposes, but why would anyone want that for regular LLM interaction? There are two main reasons:

You have a complex research project and you're frustrated with having to take your work to a brand new thread and essentially starting over.
You've built a working relationship with the model — it knows how you want data interpreted, caveats inserted, markups drafted, etc. — and you don't want to lose all of that.

Finally, the ability of an epistemically, ontologically, and dialectically inspired framework to significantly extend coherent operation within transformer-bounded AI architecture shows the field that these disciplines can act as genuine engineering levers. This can provide the industry with more options to help create better AI as the world keeps demanding systems that are more capable and more ubiquitous, while still being safe and reliable for human use.

u/RazzmatazzAccurate82 — 5 days ago