u/rs38

r/LocalLLM r/Vllm r/llamacpp

▲ 3 r/llamacpp+2 crossposts

vllm vs llama.cpp vs ollama vs sglang

whats your take?

do you manage to get single developer/person workflows spawning subagents to gain from the parallel-optimized engines?

from:

https://github.com/murataslan1/local-ai-coding-guide/blob/main/guides/runner-comparison.md

Are you a single developer on desktop?

├─ Yes → Do you want simplicity? → Ollama

│ Want fine control? → llama.cpp

│

└─ No → Running a team server?

├─ High throughput needed → vLLM

└─ Structured JSON outputs → SGLang

u/rs38 — 12 days ago

▲ 2 r/Vllm

is there a way to run vLLM on Windows/WSL?

(seems to be an issue only with newer vLLM Versions (v0.22/23) on CUDA?)

didn't expect to be vLLM that complicated compared to llama.cpp...

wanted to try more parallelism and safetensor models, but failed to get it run at all

Initializing a V1 LLM engine (v0.23.0) with config:
model='Qwen/Qwen3-0.6B', ´

[...]

WARNING  Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
[...]
 Using V2 Model Runner
ERROR  [core.py:1195] EngineCore failed to start.

the same happens when running with docker.

or it can be:

ERROR 05-22 10:02:39 [core.py:1159] RuntimeError: UVA is not available

https://github.com/vllm-project/vllm/issues/43381

I noticed there are at least 2 Windows forks...which I didn't try so far.

u/rs38 — 15 days ago