u/povedaaqui

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with:

docker run --gpus all \
--name qwen36-aggressive \
--restart unless-stopped \
-p 8000:8000 \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--shm-size=32g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=600 \
vllm/vllm-openai:cu130-nightly \
--model Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name qwen36 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.75 \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--max-num-seqs 4 \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--reasoning-parser qwen3 \
--performance-mode throughput \
--default-chat-template-kwargs '{"preserve_thinking":true}' \
--override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

It boots successfully and seems stable so far, but I’d love opinions from people running similar long-context / agentic setups.

Any feedback or suggestions are welcome.

docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -e FLASHINFER_DISABLE_VERSION_CHECK=1 \ -e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=600 \ vllm/vllm-openai:cu130-nightly \ --model Qwen/Qwen3.6-35B-A3B-FP8 \ --served-model-name qwen36 \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.75 \ --dtype auto \ --kv-cache-dtype fp8 \ --max-model-len 262144 \ --max-num-batched-tokens 32768 \ --max-num-seqs 4 \ --attention-backend flashinfer \ --enable-prefix-caching \ --enable-chunked-prefill \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --trust-remote-code \ --reasoning-parser qwen3 \ --performance-mode throughput \ --default-chat-template-kwargs '{"preserve_thinking":true}' \ --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?