u/hazyhaar

**TL;DR**: Running Qwen3.6-27B AWQ-INT4 on a single RTX 5090 (32 GB) for legal-claim extraction in a Go pipeline. Hit two non-obvious walls that cost me half a day: (1) BF16 KV cache caps you at 16K max-model-len, but FP8 KV gets you to 24K with the same VRAM footprint; (2) `temperature=0.2` under guided JSON schema triggers infinite repetition loops on this model — and the loop is not on text, it's on a numeric field generating a single integer with 5000+ digits. Sharing 42-run sampling benchmark, exact configs, and what actually works.

Posted to corroborate the [vLLM #40080 Gemma observation](https://github.com/vllm-project/vllm/issues/40080) and the [Qwen3.5 issue #145](https://github.com/QwenLM/Qwen3.6/issues/145) with concrete numbers on a Blackwell SM_120 setup.

---

## Hardware and stack

- GPU: NVIDIA RTX 5090, 32 GB VRAM, Blackwell SM_120

- CUDA 12.8, cuDNN 9.6

- vLLM 0.19.0 via `nvcr.io/nvidia/vllm:26.04-py3`

- llama-swap v216 orchestrating three model slots:

- Vision: Qwen2-VL-7B-Instruct (16K context, BF16 KV, swap)

- Reason: **Qwen3.6-27B AWQ-INT4** (this is the one I'm writing about)

- Embed: BGE-M3 (resident, ~2.3 GB)

- Workload: legal-claim extraction of structured output via JSON Schema, ~5W1H decomposition per claim

The reasoning slot uses the [cyankiwi/Qwen3.6-27B-AWQ-INT4](https://huggingface.co/cyankiwi/Qwen3.6-27B-AWQ-INT4) build. Internal architecture is `Qwen3_5ForConditionalGeneration` (GDN hybrid + Mamba) — needs vLLM ≥ 0.17 to run at all.

---

## Wall #1: VRAM math for max-model-len on a single 32 GB card

Initial config: `--gpu-memory-utilization 0.85 --max-model-len 12288 --dtype auto`. 26.5 GB VRAM, working fine for short docs. But 37 % of my email corpus exceeds 8K tokens, and the chain-of-thought prompt I use needs ~8K output tokens for the scratchpad. So `12288 - 8192 = 4096` input budget, which overflows on most non-trivial emails.

Measured KV cache scaling with BF16:

|---|---|---|---|

| 12288 (start) | ~13 GB | 27 GB | ✓ comfortable margin |

| 16384 | ~17 GB | 31 GB | ⚠ 1 GB free, kills multi-slot co-tenancy |

| 24576 | ~26 GB | 40 GB | ✗ overflow |

| 32768 | ~35 GB | 49 GB | ✗ physically impossible |

The bench tool community on r/LocalLLaMA was telling me to "just bump to 32K", but that's not feasible at all on a 32 GB card without quantizing the KV cache. So I tried FP8 KV.

### FP8 KV cache changes the picture

Adding `--kv-cache-dtype fp8` halves the KV memory:

|---|---|---|---|

| 16384 + FP8 | ~8.5 GB | 22.5 GB | ✓ huge margin |

| 24576 + FP8 | ~13 GB | 27 GB | ✓ same footprint as 12K BF16 start |

| 32768 + FP8 | ~17 GB | 31 GB | ⚠ tight |

Empirical measurement on the live server, after killing the container and warm-up:

|---|---|---|---|

| 12288 BF16 (start) | 26.5 GB | 96 s | 5.5 GB |

| 16384 BF16 | 28.0 GB | not retested | 4.0 GB |

| **24576 FP8 (chosen)** | **28.4 GB** | **131 s** (+35 s vs BF16) | **3.6 GB** |

Counterintuitive: 24K FP8 consumes nearly the same VRAM as 16K BF16, because vLLM pre-allocates the KV pool to `gpu-memory-utilization=0.85` regardless of effective dtype/length. You don't see VRAM savings on the gauge — you capitalize the saving in *input capacity*. Net gain: input budget moves from 4K → 16K tokens at `max_tokens=8192`.

FP8 KV quality cost on AWQ-INT4 weights: theoretical 2–3 % degradation, in practice noise-level on AWQ-INT4 (the 4-bit weight quantization dominates). Validated empirically — see end of post.

### Production llama-swap config for Reason slot

```yaml

qwen3.6-27b:

cmd: >

docker run --rm --name vllm-reason

--gpus all --ipc=host

-v /inference/models:/models

-v vllm-cache:/root/.cache/vllm

-p 127.0.0.1:8003:8000

nvcr.io/nvidia/vllm:26.04-py3

vllm serve /models/qwen3.6-27b-awq-int4

--served-model-name qwen3.6-27b

--gpu-memory-utilization 0.85

--max-model-len 24576 --kv-cache-dtype fp8

--max-num-seqs 4

ttl: 300

```

---

## Wall #2: guided JSON + low temperature = infinite repetition

First smoke test of the pipeline with `max-model-len 24576` plus the corresponding client-side `MaxTokens: 8192`: one document (`04546`, a short 953-char .md) generated **68 claims, of which 67 had `text=""` and identical `char_start=107, char_end=238`**. Pure loop fail mode.

Initial hypothesis: model-level repetition bias. Looked at the literature:

- vLLM bug [#40080 (Gemma)](https://github.com/vllm-project/vllm/issues/40080): "When grammar restricts the token space to valid JSON tokens, the model's slight repetition bias becomes a strong loop because the grammar prevents the model from generating an EOS or breaking out of the pattern."

- [Qwen3.5/3.6 issue #145](https://github.com/QwenLM/Qwen3.6/issues/145): official sampling recommendation, **explicitly states "greedy decoding should not be used as it can lead to performance degradation and endless repetitions."** The pipeline was running at `T=0.2`, which is quasi-greedy.

So the bug is exactly what the vLLM ticket describes: the model has a baseline repetition tendency, guided JSON masks every token outside the schema, model can't emit EOS in the middle of an array, so it fills the array with whatever fits. On this corpus, sometimes that's `text=""` repeated, sometimes (as I found later in benchmarking) it's a single `char_start` integer with 5000+ digits.

### Bench protocol

7 sampling configs × 3 prototype documents (short, medium, complex) × 2 runs each = 42 calls against the live `:8156/v1/chat/completions` proxy (which forwards to llama-swap → vLLM Reason). Same JSON Schema, same prompt, same `max_tokens=8192`. Configs:

| Label | Sampling params |

|---|---|

| baseline_T02 | T=0.2 |

| hardened_T02 | T=0.2 + schema `minLength=1` on text + `maxItems=30` on claims |

| qwen_instruct | T=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 (official Qwen instruct mode) |

| qwen_reasoning | T=0.6, top_p=0.95, top_k=20, presence_penalty=0.0 (official Qwen reasoning mode) |

| intermediate_T04 | T=0.4, top_p=0.9, presence_penalty=0.3 |

| reppen_only | T=0.2, repetition_penalty=1.1 |

| conservative_T03 | T=0.3, top_p=0.9, presence_penalty=0.5 |

### Bench results

Unique claims persisted per run, two runs per cell:

| Config | doc 04546 | doc 04547 | doc 19958 | Avg total | Loop fails |

|---|---|---|---|---|---|

| baseline_T02 | 11 / 12 | 5 / 3 | 11 / 11 | 26.5 | 0 |

| hardened_T02 | 11 / **FAIL** | 2 / 5 | 11 / 10 | 25.0 | **1** |

| qwen_instruct | 10 / 8 | 4 / 2 | 10 / 10 | 22.0 | 0 |

| **qwen_reasoning** | **11 / 12** | **6 / 4** | **10 / 18** | **30.5** | **0** |

| intermediate_T04 | 9 / 12 | 4 / 4 | 13 / 11 | 26.5 | 0 |

| reppen_only | 6 / 6 | 3 / 5 | 10 / 10 | 20.0 | 0 |

| conservative_T03 | **FAIL** / 12 | 6 / 3 | 9 / 9 | 25.5 | **1** |

Aggregate: 42 runs, 40 successes, **2 loop failures**. Both fails were on document 04546 (the short one), both at `T ≤ 0.3`. Failure mode confirmed by Python `int()` overflow: model emitted a 5000+ digit integer in a `char_start` or `char_end` field — pure numeric loop, not a text loop. A more permissive parser (which is what I had in Go originally) would silently truncate and accept garbage.

Average successful run latency: 63.8 s. Range 23.8–111.4 s on this prompt size (~6 KB system + 1 KB user).

### Findings

**`qwen_reasoning` is the winner**: +15 % unique claim coverage over baseline, zero loop fails on the pathological doc, conforms to official Qwen3.6 recommendation. Higher variance on complex docs (19958: 10 vs 18 unique claims between runs) — to absorb with defensive dedup on the consumer side.
**`T=0.2` (quasi-greedy) is the actual bug source.** 14 % loop failure rate on the pathological doc when T ≤ 0.3, 0 % when T ≥ 0.4. The official Qwen advice is empirically correct.
**`repetition_penalty=1.1` strangles** — −25 % coverage. Not the right knob for structured generation.
**`presence_penalty=1.5`** (official Qwen instruct mode value) is meant for short conversational replies, not multi-page JSON. Strangles too (−17 %).
**`frequency_penalty=0.5`** (a desperate fix I tried earlier in the day) is catastrophic on structured output — −77 % coverage measured in production smoke. Avoid.
**Schema hardening (`minLength=1` on text, `minimum/maximum` on integer fields, `maxItems`) is complementary**, not a replacement for sampling fix. Hardened schema still failed once at T=0.2 — the loop just shifted to another field (numeric instead of text).

### Final production config

Three coordinated changes, none of them sufficient alone:

**Server (vLLM)** — already shown above, the `24576 FP8` config.

**Client sampling** (Go pipeline payload):

```json

{

"model": "qwen3.6-27b",

"temperature": 0.6,

"top_p": 0.95,

"top_k": 20,

"presence_penalty": 0.0,

"max_tokens": 8192,

"response_format": {"type": "json_schema", "json_schema": {...}}

}

```

**Client schema** (in addition to the domain fields):

```json

{

"type": "object",

"properties": {

"claims": {

"type": "array",

"maxItems": 30,

"items": {

"properties": {

"text": {"type": "string", "minLength": 1},

"char_start": {"type": ["integer", "null"], "minimum": 0, "maximum": 100000},

"char_end": {"type": ["integer", "null"], "minimum": 0, "maximum": 100000}

}

"required": ["claims"]

}

```

**Client post-LLM**: defensive dedup on `(lowercased_stripped_text, char_start, char_end)` before INSERT, with a `needs_review` flag when `unique_count / total_count < 0.5` or `total > 30`. Catches the residual variance.

### Cost on the full run

Estimated for 1402 .md files:

| Metric | Baseline (T=0.2) | qwen_reasoning |

|---|---|---|

| Avg claims latency per doc | 35–80 s | 60–110 s (+30 %) |

| Unique claims per doc | n | n × 1.15 |

| Loop-failed docs | ~2–5 % expected | 0 measured in 42 runs |

| Docs flagged `needs_review` | n/a | est. 5–15 / 1402 |

---

## What I'd hammer if anyone is doing the same setup

**Don't trust `T=0.2` for any non-trivial JSON-schema-constrained generation on Qwen3 family.** The official Qwen team flagged it, the vLLM Gemma ticket confirms it's a grammar+repetition interaction, my 42-run bench reproduces it. Use T=0.6 minimum.
**Don't use `repetition_penalty` or `frequency_penalty` to fight JSON loops** — they punish lexical variation in legitimate paraphrases. Wrong knob.
**Schema fields that accept integers need bounded ranges.** A `char_start: integer` without `maximum` is an invitation to a numeric loop.
**FP8 KV cache is the single best knob to push context length on a 32 GB consumer card.** Same VRAM footprint, ~2x effective context. Quality impact is negligible on top of an already-INT4-quantized model.
**Always log `usage.completion_tokens`** when calling `/v1/chat/completions` with structured output — if your call routinely hits the max, you've got a silent failure mode.
**Cold start on Qwen3.6-27B AWQ-INT4** with the `torch.compile` cache persisted to a Docker volume: ~96 s BF16, ~131 s with FP8 KV (extra calibration step). Without persisted cache: 141 s. Worth the volume mount.

### Reproducibility

42-run bench script, results JSON, and exact prompt assets are kept on the server side under `/tmp/claim_bench/`. Happy to share if anyone wants to repro on their own Qwen3.6 quant variant — I expect the loop behavior to generalize across AWQ-INT4 / NVFP4 / GGUF, since the root cause is the model-level repetition bias × grammar masking, not the quantization.

If anyone has a clean explanation for why the loop on `char_start` produces a *single* 5000-digit integer rather than a stream of normal integers, I'd love to hear it. My hypothesis is that once the model commits to a digit token after `"char_start": `, the only grammar-valid next tokens are more digits or `,` / `}` — and if the digit-token transition probability beats the closing-token probability, it never closes.

---

## References

- Qwen3.5/3.6 sampling recommendations: [QwenLM/Qwen3.6 issue #145](https://github.com/QwenLM/Qwen3.6/issues/145)

- Grammar-amplified repetition (vLLM): [vllm-project/vllm issue #40080](https://github.com/vllm-project/vllm/issues/40080)

- Empty-array bug under guided JSON: [vllm-project/vllm issue #13821](https://github.com/vllm-project/vllm/issues/13821)

- vLLM Quantized KV Cache doc: [docs.vllm.ai — quantized_kvcache](https://docs.vllm.ai/en/latest/features/quantization/quantized\_kvcache/)

- vLLM Structured Outputs: [docs.vllm.ai — structured_outputs](https://docs.vllm.ai/en/v0.8.2/features/structured\_outputs.html)

- Qwen3 official model card with sampling guidance: [Qwen/Qwen3-0.6B on HF](https://huggingface.co/Qwen/Qwen3-0.6B)

Setup date: 2026-05-19. Environment: `nvcr.io/nvidia/vllm:26.04-py3` (vLLM 0.19.0), Blackwell SM_120, RTX 5090 32 GB.

Qwen3.6-27B AWQ-INT4 on RTX 5090: KV cache FP8 at 24K context, and why low-temperature guided JSON loops on you