Qwen 3.6 27B - VLLM Performance Benchmark Results (BF16, FP8, NVFP4)
Sharing some testing of Qwen 3.6 27B using VLLM across the popular quants on my development system. I used llama benchy to generate the results, then fed it into an LLM to format it the tables for readibility.
While NVFP4 is blazing fast, have had looping issues in copilot that I don't get with BF16, and the responses in general when used in agent mode seem to be less thorough than the higher quants. Based on these results, FP8 seems to be the right choice. Some of the parameters can be further tuned I'm sure to get better performance but these are were all plenty fast enough for coding purposes.
I used to use llama.cpp, but have found that VLLM is in practice is faster (due to paged attention), as well as more stable (llama.cpp would give me random errors that happen frequently, requiring me to reset the prompt or restart the service).
If you have any comments or suggestions to improve let me know.
Test System:
Motherboard: Asus Proart Z890
CPU: Intel 270K plus
RAM: 96GB DDR5 (6000MHZ)
GPU: RTX 6000 Pro Blackwell 96GB (Max-Q, ECC enabled)
Software:
OS : Ubuntu 26.04 LTS (x86_64)
Python version : 3.12.13
vLLM Version : 0.24.0
NVIDIA-SMI 595.71.05
CUDA Version: 13.2
Models:
Qwen 3.6 27B - BF16 and FP8 (HF Qwen)
Qwen 3.6 27B - NVFP4 (HF Nvidia)
* replaced the delivered jinja scripts with the fixed chat template
VLLM Parameters:
GPU_COUNT="1"
MAX_LEN="262144"
export VLLM_USE_DEEP_GEMM=0
export FLASHINFER_MAX_NUM_TOKENS=8192
export TORCH_CUDA_ARCH_LIST="12.0f"
export TORCH_FLOAT32_MATMUL_PRECISION=high
export PYTORCH_ALLOC_CONF=expandable_segments:True
export VLLM_USE_FLASHINFER_SAMPLER=1
vllm serve "$MODEL_PATH" \
--port "$PORT" \
--tensor-parallel-size "$GPU_COUNT" \
--max-model-len "$MAX_LEN" \
--performance-mode interactivity \
--attention-backend FLASHINFER \
--gpu-memory-utilization 0.88 \
--max-num-seqs 2 \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--enable-prefix-caching \
--trust-remote-code
Key Performance Takeaways
- NVFP4 dominates token generation speed (~2.6x faster than BF16): Because token decoding is strictly memory-bandwidth bound, compressing weights to 4-bit dramatically slashes PCIe/VRAM data transfers, allowing generation throughput to jump from ~61 t/s (BF16) up to ~163 t/s (NVFP4).
- FP8 wins on prompt processing & prefill speed (~20% faster than BF16): Prompt prefill is compute-bound (heavy matrix math). FP8 leverages native Tensor Core acceleration with zero dequantization overhead, beating both BF16 and NVFP4 during ingestion.
- NVFP4 has a slight prefill penalty vs. FP8: Because NVFP4 must dequantize weights on the fly during large compute-heavy prefill batches, it trails FP8 by ~10–15% in prompt processing speed, though it still outperforms baseline BF16.
1. Token Generation Speed (tg32 Throughput)
Higher is better. Measures decoding speed when generating 32 new tokens across increasing context depths.
| Context Depth | BF16 (t/s) | FP8 (t/s) | NVFP4 (t/s) | Speedup (NVFP4 vs BF16) |
|---|---|---|---|---|
| Base (0k) | 59.10 ± 1.67 | 97.49 ± 4.08 | 169.23 ± 9.02 | 2.86x |
| 4k Context | 63.01 ± 3.63 | 103.03 ± 4.46 | 157.90 ± 14.55 | 2.51x |
| 8k Context | 67.55 ± 2.70 | 96.88 ± 5.11 | 166.52 ± 9.93 | 2.47x |
| 16k Context | 64.57 ± 2.99 | 101.51 ± 7.11 | 171.12 ± 0.50 | 2.65x |
| 32k Context | 59.46 ± 3.68 | 100.48 ± 4.33 | 158.04 ± 16.51 | 2.66x |
| 65k Context | 61.55 ± 2.81 | 98.99 ± 5.06 | 159.91 ± 7.52 | 2.60x |
2. Prompt Processing Speed (pp2048 Throughput)
Higher is better. Measures ingestion speed when prefilling 2048 prompt tokens across existing context depths.
| Context Depth | BF16 (t/s) | FP8 (t/s) | NVFP4 (t/s) | Speedup (FP8 vs BF16) |
|---|---|---|---|---|
| Base (0k) | 4359.28 ± 66.84 | 4747.78 ± 9.40 | 4732.42 ± 17.77 | 1.09x |
| 4k Context | 1856.76 ± 9.93 | 2250.71 ± 0.84 | 2010.97 ± 3.54 | 1.21x |
| 8k Context | 2095.89 ± 6.85 | 2479.30 ± 16.20 | 2191.59 ± 2.93 | 1.18x |
| 16k Context | 1765.10 ± 13.83 | 2029.02 ± 13.96 | 1832.65 ± 3.78 | 1.15x |
| 32k Context | 1317.16 ± 21.52 | 1503.80 ± 6.42 | 1388.85 ± 8.14 | 1.14x |
| 65k Context | 880.40 ± 6.51 | 1058.40 ± 33.99 | 902.65 ± 3.01 | 1.20x |
3. Full Context Prefill Latency (ctx_pp End-to-End TTFT)
Lower is better. Measures total Time-To-First-Token (in milliseconds) required to ingest and evaluate the entire context window.
| Context Depth | BF16 (ms) | FP8 (ms) | NVFP4 (ms) | FP8 Latency Reduction |
|---|---|---|---|---|
| 4k Context | 1023.29 ± 6.08 | 833.65 ± 14.57 | 927.45 ± 1.68 | -18.5% |
| 8k Context | 1974.69 ± 1.80 | 1415.69 ± 11.07 | 1869.70 ± 4.42 | -28.3% |
| 16k Context | 4122.54 ± 18.20 | 2926.47 ± 6.89 | 3927.95 ± 4.72 | -29.0% |
| 32k Context | 9179.91 ± 58.16 | 6572.61 ± 8.87 | 8692.01 ± 30.53 | -28.4% |
| 65k Context | 21760.57 ± 85.68 | 16425.60 ± 137.66 | 20613.26 ± 18.28 | -24.5% |
4. Standalone Peak & First-Token Metrics
Measures peak recorded generation speed and baseline TTFT without context saturation.
| Quantization Format | Peak Generation Throughput (peak t/s) | Baseline TTFT (pp2048 ttfr) | Estimated PPT (pp2048 est_ppt) |
|---|---|---|---|
| BF16 | 61.01 ± 1.72 t/s | 525.03 ± 7.29 ms | 470.14 ± 7.29 ms |
| FP8 | 100.63 ± 4.21 t/s | 469.82 ± 0.85 ms | 431.57 ± 0.85 ms |
| NVFP4 | 174.69 ± 9.31 t/s | 467.40 ± 1.62 ms | 432.98 ± 1.62 ms |