u/povedaaqui

▲ 9 r/Vllm

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with:

docker run --gpus all \
--name qwen36-aggressive \
--restart unless-stopped \
-p 8000:8000 \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--shm-size=32g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=600 \
vllm/vllm-openai:cu130-nightly \
--model Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name qwen36 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.75 \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--max-num-seqs 4 \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--reasoning-parser qwen3 \
--performance-mode throughput \
--default-chat-template-kwargs '{"preserve_thinking":true}' \
--override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

It boots successfully and seems stable so far, but I’d love opinions from people running similar long-context / agentic setups.

Any feedback or suggestions are welcome.

reddit.com
u/povedaaqui — 1 day ago

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with:

docker run --gpus all \
--name qwen36-aggressive \
--restart unless-stopped \
-p 8000:8000 \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--shm-size=32g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=600 \
vllm/vllm-openai:cu130-nightly \
--model Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name qwen36 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.75 \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--max-num-seqs 4 \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--reasoning-parser qwen3 \
--performance-mode throughput \
--default-chat-template-kwargs '{"preserve_thinking":true}' \
--override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

It boots successfully and seems stable so far, but I’d love opinions from people running similar long-context / agentic setups.

Any feedback or suggestions are welcome.

reddit.com
u/povedaaqui — 1 day ago