u/liuc0j

Why separating classification from generation made my local Qwen workflow far more stable

Why separating classification from generation made my local Qwen workflow far more stable

After months of experimentation, I think I finally reached a setup that feels genuinely stable for local production-style RAG workflows.

A real operational pipeline for:

  • editorial drafting
  • legal/environmental research
  • structured retrieval
  • HTML generation
  • document synthesis
  • controlled outputs
  • anti-hallucination workflows

Hardware

  • Mac Studio M2 Max
  • 32 GB unified memory

Stack

Running in Docker, sharing the same network:

https://preview.redd.it/7q6x49hzfa1h1.png?width=1852&format=png&auto=webp&s=0211e4ff3a9ebe65071f2d973106ac8db933d4dc

  • Open WebUI
  • PostgreSQL
  • Qdrant (6Gb of data)
  • Apache Tika
  • Open Terminal
  • Nginx Proxy Manager

Inference:

  • LM Studio (latest beta at moment 0.4.13)
  • Qwen3.5-9B (unsloth 4bit 7-9GB)
  • Qwen3.6-35B-A3B Q2_XXS (~12 GB)
  • Qwen3.6-35B-A3B Q3_K_S (~17 GB)

common parameters: 16000k context, temp 0.2, top-k 20, penalty 0.95, unified KV cache on GPU

the biggest thing I learned is that local models fail because of routing drifts, retrieval gets noisy, bloated prompts, formatting errors, tool usage loops :exploding_head: or just start narrating its own reasoning

The breakthrough for me was separating the workflow into stages instead of relying on one giant "do everything" system prompt.

My pipeline now looks roughly like this:

User query
↓
GBNF classification
↓
Routing decision
↓
Tool / retrieval
↓
Guardrails
↓
Editorial synthesis
↓
Final formatting

The most important architectural decision

I use GBNF only for the fragile parts:

  • intent classification
  • routing
  • workflow decisions
  • fallback handling
  • output mode selection

NOT for final article generation.

That was a massive improvement.

Before this, the model would often:

  • over-explain
  • invent process narration
  • repeat tool calls
  • drift stylistically
  • produce inconsistent formatting
  • ignore operational constraints

Now the grammar forces highly structured outputs like:

<classification>
OPERATION=compliance_check
DOMAIN=waste_management
REQUEST_TYPE=specific_case
SOURCE_STATUS=sources_found
VERIFICATION=verified_content
CONFIDENCE=high
</classification>

Then another block decides:

  • whether tool usage is mandatory
  • which workflow to activate
  • which output format to use
  • whether fallback mode is required

This made even aggressively quantized models dramatically more reliable.

Optimize narration flow

I explicitly banned outputs like:

"I will now search..."
"I need to verify..."
"Proceeding with analysis..."

Removing process narration improved output quality far more than expected.

Especially with quantized local models.

Shorter prompts > smarter prompts

Golden rules:

  • short hard rules
  • operational constraints
  • anti-loop logic
  • explicit fallback behavior
  • slim output structures
  • deterministic formatting

The system prompt became less "literary" and much more procedural.

Less:

  • personality
  • motivational language
  • verbose instructions
  • pseudo-chain-of-thought

More:

  • routing
  • execution constraints
  • output contracts
  • retrieval discipline

RAG quality improved when I REDUCED context noise

Another thing that surprised me:

Reducing retrieval noise improved quality much more than increasing context size.

I now heavily prioritize:

  • exact query forwarding
  • minimal paraphrasing
  • retrieval discipline
  • compact chunks
  • avoiding semantic duplication
  • limiting repeated tool calls

Qwen family on Apple Silicon

Qwen3.6-35B-A3B Q2_XXS as well Qwen3.6-35B-A3B Q3_K_S performs much better than I expected on the M2 Max.

The quantization mostly hurts:

  • instruction precision
  • formatting discipline
  • operational consistency

Not raw reasoning ability.

So strong workflow constraints compensate extremely well.

The smaller Qwen3.5-9B is also extremely useful as:

  • classifier
  • router
  • lightweight editor
  • HTML formatter
  • fast operational assistant

I use it as task tool in open webui interface.

What learned

The biggest improvement wasn't the model itself, but how the system was designed.

Separating:

  • classification
  • routing
  • retrieval
  • generation
  • formatting
  • fallback handling

made the entire system feel significantly more stable than my previous "single giant prompt" approach.

On this setup I'm getting roughly:

  • 40–45 tokens/sec on low-to-medium complexity tasks (document retrieval, summary tables, document synthesis, lightweight editorial work)
  • around 30 tokens/sec on more complex workflows (regulatory comparisons, deeper analysis, long-form generation, structured drafting)

a medium query load

I'd love to hear from others running similar local workflows. I'm still experimenting and refining the entire stack, so suggestions or examples from other setups are very welcome.

reddit.com
u/liuc0j — 6 days ago