u/live4evrr

Qwen 3.6 27B - VLLM Performance Benchmark Results (BF16, FP8, NVFP4)

Sharing some testing of Qwen 3.6 27B using VLLM across the popular quants on my development system. I used llama benchy to generate the results, then fed it into an LLM to format it the tables for readibility.

While NVFP4 is blazing fast, have had looping issues in copilot that I don't get with BF16, and the responses in general when used in agent mode seem to be less thorough than the higher quants. Based on these results, FP8 seems to be the right choice. Some of the parameters can be further tuned I'm sure to get better performance but these are were all plenty fast enough for coding purposes.

I used to use llama.cpp, but have found that VLLM is in practice is faster (due to paged attention), as well as more stable (llama.cpp would give me random errors that happen frequently, requiring me to reset the prompt or restart the service).

If you have any comments or suggestions to improve let me know.

Test System:

Motherboard: Asus Proart Z890

CPU: Intel 270K plus

RAM: 96GB DDR5 (6000MHZ)

GPU: RTX 6000 Pro Blackwell 96GB (Max-Q, ECC enabled)

Software:

OS : Ubuntu 26.04 LTS (x86_64)

Python version : 3.12.13

vLLM Version : 0.24.0

NVIDIA-SMI 595.71.05

CUDA Version: 13.2

Models:

Qwen 3.6 27B - BF16 and FP8 (HF Qwen)

Qwen 3.6 27B - NVFP4 (HF Nvidia)

* replaced the delivered jinja scripts with the fixed chat template

VLLM Parameters:

GPU_COUNT="1"

MAX_LEN="262144"

export VLLM_USE_DEEP_GEMM=0

export FLASHINFER_MAX_NUM_TOKENS=8192

export TORCH_CUDA_ARCH_LIST="12.0f"

export TORCH_FLOAT32_MATMUL_PRECISION=high

export PYTORCH_ALLOC_CONF=expandable_segments:True

export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve "$MODEL_PATH" \

--port "$PORT" \

--tensor-parallel-size "$GPU_COUNT" \

--max-model-len "$MAX_LEN" \

--performance-mode interactivity \

--attention-backend FLASHINFER \

--gpu-memory-utilization 0.88 \

--max-num-seqs 2 \

--enable-chunked-prefill \

--max-num-batched-tokens 8192 \

--kv-cache-dtype fp8 \

--reasoning-parser qwen3 \

--enable-auto-tool-choice \

--tool-call-parser qwen3_coder \

--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \

--enable-prefix-caching \

--trust-remote-code

Key Performance Takeaways

  • NVFP4 dominates token generation speed (~2.6x faster than BF16): Because token decoding is strictly memory-bandwidth bound, compressing weights to 4-bit dramatically slashes PCIe/VRAM data transfers, allowing generation throughput to jump from ~61 t/s (BF16) up to ~163 t/s (NVFP4).
  • FP8 wins on prompt processing & prefill speed (~20% faster than BF16): Prompt prefill is compute-bound (heavy matrix math). FP8 leverages native Tensor Core acceleration with zero dequantization overhead, beating both BF16 and NVFP4 during ingestion.
  • NVFP4 has a slight prefill penalty vs. FP8: Because NVFP4 must dequantize weights on the fly during large compute-heavy prefill batches, it trails FP8 by ~10–15% in prompt processing speed, though it still outperforms baseline BF16.

1. Token Generation Speed (tg32 Throughput)

Higher is better. Measures decoding speed when generating 32 new tokens across increasing context depths.

Context Depth BF16 (t/s) FP8 (t/s) NVFP4 (t/s) Speedup (NVFP4 vs BF16)
Base (0k) 59.10 ± 1.67 97.49 ± 4.08 169.23 ± 9.02 2.86x
4k Context 63.01 ± 3.63 103.03 ± 4.46 157.90 ± 14.55 2.51x
8k Context 67.55 ± 2.70 96.88 ± 5.11 166.52 ± 9.93 2.47x
16k Context 64.57 ± 2.99 101.51 ± 7.11 171.12 ± 0.50 2.65x
32k Context 59.46 ± 3.68 100.48 ± 4.33 158.04 ± 16.51 2.66x
65k Context 61.55 ± 2.81 98.99 ± 5.06 159.91 ± 7.52 2.60x

2. Prompt Processing Speed (pp2048 Throughput)

Higher is better. Measures ingestion speed when prefilling 2048 prompt tokens across existing context depths.

Context Depth BF16 (t/s) FP8 (t/s) NVFP4 (t/s) Speedup (FP8 vs BF16)
Base (0k) 4359.28 ± 66.84 4747.78 ± 9.40 4732.42 ± 17.77 1.09x
4k Context 1856.76 ± 9.93 2250.71 ± 0.84 2010.97 ± 3.54 1.21x
8k Context 2095.89 ± 6.85 2479.30 ± 16.20 2191.59 ± 2.93 1.18x
16k Context 1765.10 ± 13.83 2029.02 ± 13.96 1832.65 ± 3.78 1.15x
32k Context 1317.16 ± 21.52 1503.80 ± 6.42 1388.85 ± 8.14 1.14x
65k Context 880.40 ± 6.51 1058.40 ± 33.99 902.65 ± 3.01 1.20x

3. Full Context Prefill Latency (ctx_pp End-to-End TTFT)

Lower is better. Measures total Time-To-First-Token (in milliseconds) required to ingest and evaluate the entire context window.

Context Depth BF16 (ms) FP8 (ms) NVFP4 (ms) FP8 Latency Reduction
4k Context 1023.29 ± 6.08 833.65 ± 14.57 927.45 ± 1.68 -18.5%
8k Context 1974.69 ± 1.80 1415.69 ± 11.07 1869.70 ± 4.42 -28.3%
16k Context 4122.54 ± 18.20 2926.47 ± 6.89 3927.95 ± 4.72 -29.0%
32k Context 9179.91 ± 58.16 6572.61 ± 8.87 8692.01 ± 30.53 -28.4%
65k Context 21760.57 ± 85.68 16425.60 ± 137.66 20613.26 ± 18.28 -24.5%

4. Standalone Peak & First-Token Metrics

Measures peak recorded generation speed and baseline TTFT without context saturation.

Quantization Format Peak Generation Throughput (peak t/s) Baseline TTFT (pp2048 ttfr) Estimated PPT (pp2048 est_ppt)
BF16 61.01 ± 1.72 t/s 525.03 ± 7.29 ms 470.14 ± 7.29 ms
FP8 100.63 ± 4.21 t/s 469.82 ± 0.85 ms 431.57 ± 0.85 ms
NVFP4 174.69 ± 9.31 t/s 467.40 ± 1.62 ms 432.98 ± 1.62 ms
reddit.com
u/live4evrr — 8 hours ago