u/joorklee — reddlx

▲ 26 r/LocalLLaMA

Who Has The “Jankiest” Local LLM Setup? | Non-Official | Fun Contest | No Prizes

Had an idea for a fun no prize/non official competition to see who has the “Jankiest” local LLM setup.

NOTE: This is NOT an official competition. There are NO prizes. This is just for fun.

Rules:

One Submission via comment per person
Has to be your current setup or your previous setup.
Submission comment cannot be modified after posting to ensure no photo swapping occurs.
No prizes. To ensure there is less incentive to attempt to rig the competition and since this is not an official contest.
Highest upvoted submission that doesn’t violate reddit tos, /r/locallama rules, or this non official competition rules will be declared the winner after 24 hours from this post being posted.

Requirements for submission:

Photo of the local llm setup,
Any explanation/benchmarks/etc (optional) that you want to include

reddit.com

u/joorklee — 4 hours ago

▲ 13 r/LocalLLaMA

Findings from troubleshooting p2p on 4x5060 ti bifurcation.

I dumped the last week deep diving this and I’m I’ve been using Linux for 14 years and am a cloud systems engineer with a focus on supported Linux infrastructure for a private cloud provider.

Essentially, if you are using a single 4x4 bifurcation pcie x16 card inserted into your x16 slot on your mobo and you have 4x gpus connected to it. Regardless of pcie generation that card that does the bifurcation is the choke point for p2p communication. It acts as the pcie bridge that connects the gpus and with TP=4 the bandwidth of that fabric that connects the 4 cards on that pci
E bridge will become saturated and yield worse performance than with p2p off. The ways to deal with this would be to either:

Don’t run p2p. It’s only a 10 to 15% gain and may not justify the cost and effort of having a setup where p2p gets you that 10% performance.
Pick up a Chinese slimsas bifurcation bridge. Supposedly you might not encounter it with those. They run between 150 to 250
Buy a 1200 gen 4 pcie bridge from Cpayne. These devices are specifically made for this use case. But 1200 expense for 10% performance gain probably isn’t worth it
Don’t use tensor parallelism. Use pipeline parallelism. The downside with this is pipeline parallelism in my benchmarks yielded worse performance at low concurrency than TP=4 + P2P off. PP=4 only yields better performance if you have significant enough concurrency where all the gpus have something they can be working on where none of them are waiting on another GPU to finish their work
There are used PLX switches on eBay. But with these you run a risk of them not supporting a multi GPU setup with P2P due to firmware restrictions that limit non storage devices being used with them.
Have a motherboard and cpu combo that provides a dedicated x16 lanes to both the primary and secondary x16 slot. You could have both of these with 8i bifurcation with 2 gpus on each. But if that setup requires a retimer to get gen4 or gen 5 then you are talking 130+ for each of these two retimer bifurcation cards.

If there is a solution to this that I didn’t list, please let me know and I’ll update this post.

reddit.com

u/joorklee — 9 days ago

▲ 6 r/LocalLLaMA

Idea for how to run GLM2 at a decent quant, need critique/feedback

I am currently running a 4x 5060 ti P2P rig (64 GB VRAM total)where each card is running at gen 3 with 4 pcie lanes per card.
My use case is inference only. During my benchmarking the bottleneck was compute, not pcie bandwidth for low concurrency inference tasks, such as a single user use case.

This gave me an idea, since my cards are already running at gen 3 pcie, I could pickup 512 GB of DDR3 16 gb modules, a gen 3 server that has 16 dedicated pci lanes to the x16 slot, and supports 4x4 bifurcation and you might be able to get the most economically viable setup for glm2 at a decent quant without the 5 tokens per second that you get with unified memory clusters.

For example Supermicro X9DRi-F / X9DR3-F supports 16 dim slots up and would support 512 gb of ram.

512 gb of ddr3 server ram is 500 dollars roughly.

You can get a 5060 ti 16gb model for 425 usd if you hunt for a deal.
So 1700 in GPU costs plus 500 in ram cost plus whatever the mobo and cpu costs.

And with those gpus you would be able to run Qwen/Qwen3.6-27B-FP8 with bf16 kv cache at max context 262k at 72 tokens per second entirely in vram that I mentioned with my previous post.

Am I missing something or would this be viable for running glm2?

reddit.com

u/joorklee — 13 days ago

▲ 80 r/LocalLLaMA

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Hey peeps, wanted to share what is possible for folks with an inference only single user use case with 1700 in GPU cost.

Setup: 4x 5060 ti (16GB) with P2P

If you are in the US and you keep an eye on facebook marketplace and places like slickdeals you can find some 5060 ti 16 GB models for 425 to 475 used.

A giant caveat is this type of configuration is only viable if your only interested in strictly inference.

The VLLM Command Used:

export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export NCCL_P2P_DISABLE=0
export NCCL_CUMEM_ENABLE=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export TORCH_FLOAT32_MATMUL_PRECISION=high
export PYTORCH_ALLOC_CONF=expandable_segments:True
# dropped: VLLM_USE_FLASHINFER_MOE_FP8 (dense model), VLLM_TEST_FORCE_FP8_MARLIN (test native FP8 first)

vllm serve /data/models/Qwen/Qwen3.6-27B-FP8 \
  --host 0.0.0.0 --port 8080 \
  --tensor-parallel-size 4 \
  --performance-mode interactivity \
  --trust-remote-code \
  --language-model-only \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --max-model-len 262144 \
  --kv-cache-dtype bfloat16 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}' \
  --compilation-config '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}' \
  --async-scheduling \
  --attention-backend flashinfer \
  --enable-prefix-caching

Benchmark Command:
vllm bench serve --backend vllm --base-url http://localhost:8080 --endpoint /v1/completions --model /data/models/Qwen/Qwen3.6-27B-FP8 --dataset-name random --random-input-len 4096 --random-output-len 1024 --num-prompts 40 --max-concurrency 1 --num-warmups 5 --ignore-eos --seed 1234 --percentile-metrics ttft,tpot,itl,e2el --save-result --result-filename qwen36_c1_4k.json

============ Serving Benchmark Result ============
Successful requests:                     40        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  735.75    
Total input tokens:                      163840    
Total generated tokens:                  40960     
Request throughput (req/s):              0.05      
Output token throughput (tok/s):         55.67     
Peak output token throughput (tok/s):    25.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          278.36    
---------------Time to First Token----------------
Mean TTFT (ms):                          4226.91   
Median TTFT (ms):                        4315.47   
P99 TTFT (ms):                           4320.32   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.85     
Median TPOT (ms):                        13.44     
P99 TPOT (ms):                           25.61     
---------------Inter-token Latency----------------
Mean ITL (ms):                           40.91     
Median ITL (ms):                         40.84     
P99 ITL (ms):                            41.59     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          18393.49  
Median E2EL (ms):                        17991.18  
P99 E2EL (ms):                           30508.70  
---------------Speculative Decoding---------------
Acceptance rate (%):                     65.25     
Acceptance length:                       2.96      
Drafts:                                  13853     
Draft tokens:                            41559     
Accepted tokens:                         27116     
Per-position acceptance (%):
  Position 0:                            78.29     
  Position 1:                            64.14     
  Position 2:                            53.31     
==================================================

note: I forgot I had --max-num-seqs at 4 but I benchmarked with 1 concurrency.

reddit.com

u/joorklee — 16 days ago

▲ 9 r/CUDA+1 crossposts

P2P benchmarks on 2x 5060 ti (16GB each) - P2P Benchmark Project

Spent the last couple of days creating automation to be able to systematically benchmark and measure performance gains on P2P soft-locked nvidia cards using the patched kernel modules.

joorklee.github.io

u/joorklee — 18 days ago

▲ 1.2k r/HomeLabPorn+1 crossposts

40u rack acquired

u/joorklee — 1 month ago