
Why separating classification from generation made my local Qwen workflow far more stable
After months of experimentation, I think I finally reached a setup that feels genuinely stable for local production-style RAG workflows.
A real operational pipeline for:
- editorial drafting
- legal/environmental research
- structured retrieval
- HTML generation
- document synthesis
- controlled outputs
- anti-hallucination workflows
Hardware
- Mac Studio M2 Max
- 32 GB unified memory
Stack
Running in Docker, sharing the same network:
- Open WebUI
- PostgreSQL
- Qdrant (6Gb of data)
- Apache Tika
- Open Terminal
- Nginx Proxy Manager
Inference:
- LM Studio (latest beta at moment 0.4.13)
- Qwen3.5-9B (unsloth 4bit 7-9GB)
- Qwen3.6-35B-A3B Q2_XXS (~12 GB)
- Qwen3.6-35B-A3B Q3_K_S (~17 GB)
common parameters: 16000k context, temp 0.2, top-k 20, penalty 0.95, unified KV cache on GPU
the biggest thing I learned is that local models fail because of routing drifts, retrieval gets noisy, bloated prompts, formatting errors, tool usage loops :exploding_head: or just start narrating its own reasoning
The breakthrough for me was separating the workflow into stages instead of relying on one giant "do everything" system prompt.
My pipeline now looks roughly like this:
User query
↓
GBNF classification
↓
Routing decision
↓
Tool / retrieval
↓
Guardrails
↓
Editorial synthesis
↓
Final formatting
The most important architectural decision
I use GBNF only for the fragile parts:
- intent classification
- routing
- workflow decisions
- fallback handling
- output mode selection
NOT for final article generation.
That was a massive improvement.
Before this, the model would often:
- over-explain
- invent process narration
- repeat tool calls
- drift stylistically
- produce inconsistent formatting
- ignore operational constraints
Now the grammar forces highly structured outputs like:
<classification>
OPERATION=compliance_check
DOMAIN=waste_management
REQUEST_TYPE=specific_case
SOURCE_STATUS=sources_found
VERIFICATION=verified_content
CONFIDENCE=high
</classification>
Then another block decides:
- whether tool usage is mandatory
- which workflow to activate
- which output format to use
- whether fallback mode is required
This made even aggressively quantized models dramatically more reliable.
Optimize narration flow
I explicitly banned outputs like:
"I will now search..."
"I need to verify..."
"Proceeding with analysis..."
Removing process narration improved output quality far more than expected.
Especially with quantized local models.
Shorter prompts > smarter prompts
Golden rules:
- short hard rules
- operational constraints
- anti-loop logic
- explicit fallback behavior
- slim output structures
- deterministic formatting
The system prompt became less "literary" and much more procedural.
Less:
- personality
- motivational language
- verbose instructions
- pseudo-chain-of-thought
More:
- routing
- execution constraints
- output contracts
- retrieval discipline
RAG quality improved when I REDUCED context noise
Another thing that surprised me:
Reducing retrieval noise improved quality much more than increasing context size.
I now heavily prioritize:
- exact query forwarding
- minimal paraphrasing
- retrieval discipline
- compact chunks
- avoiding semantic duplication
- limiting repeated tool calls
Qwen family on Apple Silicon
Qwen3.6-35B-A3B Q2_XXS as well Qwen3.6-35B-A3B Q3_K_S performs much better than I expected on the M2 Max.
The quantization mostly hurts:
- instruction precision
- formatting discipline
- operational consistency
Not raw reasoning ability.
So strong workflow constraints compensate extremely well.
The smaller Qwen3.5-9B is also extremely useful as:
- classifier
- router
- lightweight editor
- HTML formatter
- fast operational assistant
I use it as task tool in open webui interface.
What learned
The biggest improvement wasn't the model itself, but how the system was designed.
Separating:
- classification
- routing
- retrieval
- generation
- formatting
- fallback handling
made the entire system feel significantly more stable than my previous "single giant prompt" approach.
On this setup I'm getting roughly:
- 40–45 tokens/sec on low-to-medium complexity tasks (document retrieval, summary tables, document synthesis, lightweight editorial work)
- around 30 tokens/sec on more complex workflows (regulatory comparisons, deeper analysis, long-form generation, structured drafting)
I'd love to hear from others running similar local workflows. I'm still experimenting and refining the entire stack, so suggestions or examples from other setups are very welcome.