Hey all. Just set up a workstation with two NVIDIA RTX PRO 6000 Blackwells (96GB VRAM each) for our design studio. Want to use Ollama as our main local inference layer.

What we want to do with it:

Internal copilot for a ~60 person team. research, writing, brief analysis, code assist
Backend for agentic tools we're building (API access is a big reason we picked Ollama)
Run the biggest, best models our hardware can handle

Specific questions:

How well does Ollama handle dual GPU setups out of the box? Any config needed for tensor parallelism across both cards?
What models would you recommend at this VRAM level? Thinking Llama 3.1 70B unquantized, maybe even 405B at Q4?
Anyone serving Ollama to a team via Open WebUI or similar? How's the experience at 10-15 concurrent users?
Any gotchas with large model loading times or memory management I should know about?

First time running Ollama beyond hobby experiments, so any production-ish tips are appreciated. Will report back with what works.

------

UPDATE FOR OTHERS & THANKS FOR THE HELP . THIS SUB WASN'T AS SNARKY AND IN FACT A LOT MORE HELPFUL THAN THE OTHER ONE.

For context: we're a design agency rendering 3D animations, VR/AR walkthroughs, and architectural visualizations. Not generating AI images or running Stable Diffusion farms. The dual RTX Pro 6000s (96 GB VRAM each) are a dedicated render node that processes overnight animation batches and path-traced scenes while our design team stays productive on their own workstations. Cloud rendering costs add up absurdly fast at our project volume. Owning the hardware pays for itself in months. OctaneRender and Redshift scale linearly across both GPUs, which turns 12+ hour VR renders into something we can actually deliver on client deadlines.

Key Technical Advice & Actionables

Infrastructure Stack (Overwhelming Consensus)

Switch from Ollama to vLLM or llama.cpp

169 upvotes on "Tip #1 don't use Ollama"
109 upvotes on criticism of using Ollama with $25k hardware
vLLM is the top recommendation for multi-user concurrency (your 10-15 concurrent users scenario)
llama.cpp is acceptable for single-user or simpler setups, but vLLM wins for parallelization

Use Linux instead of Windows

266 upvotes on "Tip #2 use Linux"
Ubuntu LTS 24.04 most recommended for NVIDIA driver support
Debian headless for maximum resource efficiency
Debate exists: some claim Windows CUDA drivers are 2-3% faster for pure VRAM inference, but Linux wins for stability and virtual memory handling

Model Recommendations

Stop using Llama 3.1 70B (described as "ancient" and "severely outdated")

Minimax M2.7 (230B MoE, 10B active) with NVFP4 quantization — perfect fit for your dual 96GB setup
Qwen 3.5/3.6 series (27B, 35B MoE, 122B) — excellent dense models, great for agentic tasks
Gemma 4 — recommended if you need "western" models (some companies ban Chinese models)
Mistral Medium 3.5 (119B MoE) or new Mistral 128B dense — good for massive context windows

Critical Configuration Settings

Use Tensor Parallelism (tp=2)

Splits model across both GPUs for unified inference
Doubles speed and allows models up to ~180-190GB total
Essential command: --tp 2 in vLLM or llama.cpp

Use NVFP4 Quantization

Hardware-accelerated 4-bit format specifically for Blackwell architecture
Minimax M2.7 NVFP4 fits in 130.6GB (down from 230GB)
Multiple users emphasized this is purpose-built for your cards

Optimize for Concurrency

Use litellm as a model router in front of vLLM for rate limiting and monitoring
Set --gpu-memory-utilization 0.9 or higher to maximize KV cache
SGLang recommended over vLLM if team works on same projects (prefix caching with RadixAttention)
For 60-person team: expect 5-8 simultaneous users per card on 70B Q4 before throughput drops

System Architecture

Cooling & Power Management

GPU spacing: minimum 2 slots apart for adequate airflow
Consider power limiting cards to reduce heat and increase stability
Script fixed clock times (10MHz below stock) to prevent PCIe bus spikes
Heat management is critical for sustained inference loads

RAM Requirements

Minimum 256GB system RAM
Recommendation: 2× VRAM = 384-512GB system RAM for optimal performance
Essential for virtual memory handling during large context operations

Frontend & User Access

Open WebUI is acceptable for team deployment (contrary to one dismissive comment)
Alternative: Set up litellm for monitoring, rate limiting, API key generation
Some debate about OpenWebUI in 2026, but no clear superior alternative mentioned for your use case

Specific Guides & Resources Mentioned

vLLM Blackwell guide: https://github.com/lastloop-ai/vllm-blackwell-guide (120+ t/s on Qwen 27B, 200+ t/s on 35B MoE)
Ollama agent configs: https://github.com/caliber-ai-org/ai-setup (888 stars, production patterns for team deployment)
llama-swap tool for dynamic model switching without container restarts

Hiring & Operational Advice

Top upvoted wisdom (113+ votes on original thread you referenced): "Storage, model management, permissions, and user access become more important than the GPUs after week one. Hire someone experienced with this stack."

Hi folks, I run a 60-person design agency (brand, UI/UX, motion, CGI) and we just invested in a high-end dual-GPU workstation. Two NVIDIA RTX PRO 6000 Blackwells.

Now I want to squeeze every bit of value out of this thing. Here's what we're looking to do:

Use cases:

Design workflows | AI-assisted ideation, image gen, upscaling, style transfer
Local inference | running open-weight LLMs for internal research, copywriting, code assist, client brief analysis
Fine-tuning | potentially training LoRAs or small domain-specific models on our design/brand data
Video & motion | AI-assisted animation, interpolation, video gen experiments

What I'd love advice on:

What models should I be running locally with this VRAM? (96GB × 2)
Best serving stack? (vLLM, Ollama, text-generation-webui, something else?)
Anyone running Stable Diffusion / ComfyUI / Flux on similar hardware. What's your workflow?
Any tips on multi-GPU setup for inference vs. keeping one GPU free for rendering?

Open to any "I wish I'd known this on day one" advice. Thanks!

u/AmanNonZero