After months of experimentation, I think I finally reached a setup that feels genuinely stable for local production-style RAG workflows.

A real operational pipeline for:

editorial drafting
legal/environmental research
structured retrieval
HTML generation
document synthesis
controlled outputs
anti-hallucination workflows

Hardware

Mac Studio M2 Max
32 GB unified memory

Stack

Running in Docker, sharing the same network:

https://preview.redd.it/7q6x49hzfa1h1.png?width=1852&format=png&auto=webp&s=0211e4ff3a9ebe65071f2d973106ac8db933d4dc

Open WebUI
PostgreSQL
Qdrant (6Gb of data)
Apache Tika
Open Terminal
Nginx Proxy Manager

Inference:

LM Studio (latest beta at moment 0.4.13)
Qwen3.5-9B (unsloth 4bit 7-9GB)
Qwen3.6-35B-A3B Q2_XXS (~12 GB)
Qwen3.6-35B-A3B Q3_K_S (~17 GB)

common parameters: 16000k context, temp 0.2, top-k 20, penalty 0.95, unified KV cache on GPU

the biggest thing I learned is that local models fail because of routing drifts, retrieval gets noisy, bloated prompts, formatting errors, tool usage loops :exploding_head: or just start narrating its own reasoning

The breakthrough for me was separating the workflow into stages instead of relying on one giant "do everything" system prompt.

My pipeline now looks roughly like this:

User query
↓
GBNF classification
↓
Routing decision
↓
Tool / retrieval
↓
Guardrails
↓
Editorial synthesis
↓
Final formatting

The most important architectural decision

I use GBNF only for the fragile parts:

intent classification
routing
workflow decisions
fallback handling
output mode selection

NOT for final article generation.

That was a massive improvement.

Before this, the model would often:

over-explain
invent process narration
repeat tool calls
drift stylistically
produce inconsistent formatting
ignore operational constraints

Now the grammar forces highly structured outputs like:

&lt;classification&gt;
OPERATION=compliance_check
DOMAIN=waste_management
REQUEST_TYPE=specific_case
SOURCE_STATUS=sources_found
VERIFICATION=verified_content
CONFIDENCE=high
&lt;/classification&gt;

Then another block decides:

whether tool usage is mandatory
which workflow to activate
which output format to use
whether fallback mode is required

This made even aggressively quantized models dramatically more reliable.

Optimize narration flow

I explicitly banned outputs like:

"I will now search..."
"I need to verify..."
"Proceeding with analysis..."

Removing process narration improved output quality far more than expected.

Especially with quantized local models.

Shorter prompts > smarter prompts

Golden rules:

short hard rules
operational constraints
anti-loop logic
explicit fallback behavior
slim output structures
deterministic formatting

The system prompt became less "literary" and much more procedural.

Less:

personality
motivational language
verbose instructions
pseudo-chain-of-thought

routing
execution constraints
output contracts
retrieval discipline

RAG quality improved when I REDUCED context noise

Another thing that surprised me:

Reducing retrieval noise improved quality much more than increasing context size.

I now heavily prioritize:

exact query forwarding
minimal paraphrasing
retrieval discipline
compact chunks
avoiding semantic duplication
limiting repeated tool calls

Qwen family on Apple Silicon

Qwen3.6-35B-A3B Q2_XXS as well Qwen3.6-35B-A3B Q3_K_S performs much better than I expected on the M2 Max.

The quantization mostly hurts:

instruction precision
formatting discipline
operational consistency

Not raw reasoning ability.

So strong workflow constraints compensate extremely well.

The smaller Qwen3.5-9B is also extremely useful as:

classifier
router
lightweight editor
HTML formatter
fast operational assistant

I use it as task tool in open webui interface.

What learned

The biggest improvement wasn't the model itself, but how the system was designed.

Separating:

classification
routing
retrieval
generation
formatting
fallback handling

made the entire system feel significantly more stable than my previous "single giant prompt" approach.

On this setup I'm getting roughly:

40–45 tokens/sec on low-to-medium complexity tasks (document retrieval, summary tables, document synthesis, lightweight editorial work)
around 30 tokens/sec on more complex workflows (regulatory comparisons, deeper analysis, long-form generation, structured drafting)

a medium query load

I'd love to hear from others running similar local workflows. I'm still experimenting and refining the entire stack, so suggestions or examples from other setups are very welcome.

u/liuc0j

Why separating classification from generation made my local Qwen workflow far more stable