r/LocalAIServers

I asked Codex to optimize DeepSeek V4 Flash 8-bit MLX on oMLX. Got ~1.6x prefill and ~3x decode speedup.

Follow-up to my earlier posts:

Should I sell my Mac Studio? https://www.reddit.com/r/MacStudio/s/GK7QP8Lg87
Kimi benchmark: https://www.reddit.com/r/LocalLLaMA/s/ujBsYLYmpd

Short version: my Mac Studio was sitting mostly idle, and from those Reddit threads I learned about DS4 and then oMLX. DS4 got me running DeepSeek V4 locally, but I wanted the 8-bit MLX version because I worry about accuracy loss in 4-bit variants.

So I tried mlx-community/deepseek-ai-DeepSeek-V4-Flash-8bit, the 302GB 8-bit affine MLX model, and asked Codex to optimize oMLX for the model.

I am not an oMLX/Metal kernel expert, so I am sharing this partly to sanity-check the work. Codex claims the changes should not reduce accuracy, and my Hermes/tool-calling runs look fine so far, but I would appreciate review from people who understand this stack better.

Base oMLX work

DeepSeek V4 support/tool calling came from oMLX DeepSeek V4 DSML/template/parser work, especially:

https://github.com/jundot/omlx/pull/2048

There were also follow-up fixes for DSML/tool-call stopping, parser-side stop behavior, prompt/prefix-cache determinism, shared expert SwiGLU clamp behavior, and native DeepSeek V4 2-bit/3-bit Metal paths.

The work below was separate: it focused on making the 8-bit affine model faster while keeping the same 302GB model format.

What changed for 8-bit affine

The issue was that DeepSeek V4 Flash 8-bit affine MoE was falling back to slower generic affine paths instead of using native DeepSeek MoE Metal kernels.

Codex changed:

Enabled native DeepSeek affine MoE kernels for bits=8, group_size=64
Added 8-bit affine Metal kernel instantiations
Replaced some generic route sorting with bucket/counting route paths
Set route_sort_min_routes=1 so the native route path is used earlier
Added route-indexed decode kernels to avoid route sort/materialization overhead during decode
Tuned affine8 dequant/load with a uint32 load specialization
Verified DeepSeek V4 parser/template and OpenAI-style tool calling still worked

Current config:

affine8_variant = 7
route_sort = bucket
route_sort_min_routes = 1
affine8_route_decode = 1

Results

Metric	Before patches	After patches	Notes
Prefill	~300-321 tok/s	~533 tok/s	12K prompt, salted uncached
Prefill	~300-321 tok/s	~528-530 tok/s	30K prompt, salted uncached
Decode	~7.31 tok/s	~20-22 tok/s	Controlled benchmark
Decode	~7.31 tok/s	~19.5-20.7 tok/s	Real Hermes runs, ~80K-120K context

Recent real Hermes/oMLX runs:

Prompt size	Output	Result	Notes
79K	1,175 tokens	19.8 tok/s	Long-context run
80K	Tool call	20.7 tok/s	Tool-calling run
95K	443 tokens	19.5 tok/s	Long-context run
117K	1,680 tokens	19.2 tok/s	Tool-calling run
119K	431 tokens	19.3 tok/s	Tool-calling run

Accuracy / correctness question

Codex says there should be no meaningful accuracy loss because:

weights and quantization format were unchanged
router/top-k/expert selection was not intentionally changed
no experts were dropped
optimized affine8 MoE outputs were compared against gather/qmm/native reference paths
focused affine8 tests, DeepSeek V4 parser/template tests, and live tool-call smoke tests passed

I understand tiny numerical drift may still happen because kernel/load/order changed, but Codex claims this is not the same as a model-level accuracy drop.

Is that reasoning sound? What evals/tests would you run to verify no meaningful accuracy or tool-calling regression?

Next optimization direction?

Codex suggested these possible next directions:

More affine8 dequant/load tuning: better vectorized loads, memory coalescing, fewer scale/bias reloads, less threadgroup-memory pressure
Fewer kernel launches / less intermediate movement in routing and MoE buffers
More fused MoE work, although this seems harder and riskier
More single-token decode profiling, since real runs are still around ~20 tok/s
Better instrumentation around routing, bucket/sort, block-plan build, native kernel time, down-projection, affine8 dequant/load, and decode costs

Questions:

Is affine8 dequant/load tuning the right next direction for prefill?
Has anyone done similar DeepSeek MoE route-indexed/fused/affine8 work in MLX, oMLX, llama.cpp, vLLM, or another runtime?
Is ~530 tok/s prefill and ~20 tok/s decode on a Mac Studio M3 Ultra 512GB close to the ceiling for this 302GB 8-bit model, or is there obvious headroom?

Again, I am mostly asking the community to verify whether the result and next direction make sense.

u/No_Run8812 — 13 hours ago

▲ 7 r/LocalAIServers

Anyone tested these?

https://www.ebay.com/itm/267162620511

u/Any_Praline_8178 — 21 hours ago

▲ 174 r/LocalAIServers+20 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 1 day ago

▲ 41 r/LocalAIServers

10x RTX 6000 PRO

Hi Guys,

I need a bit of advice, we're planning on procuring a server with 10x RTX 6000 PRO for local inference tasks.

I've configured a machine with config here

https://gpumachines.com/shared/asrock-20rack-204u10g-gnr2-2frf-2b-10x-6000-gpu-server-618320

Essentially it's 10x RTX 6000 Pro, but also with 2TB of RAM. I heard a rule of thumb of at least 2GB of RAM per 1GB of GPU VRAM. Now the question is - do I need that much RAM ? Cause we all know this eats up budget by a lot and I'd love to optimise the cost.

What do you think guys ? What's your experience ? Am I right saying that this rule of thumb is not entirely valid as it all depends on workload ?

u/kumits-u — 3 days ago

▲ 35 r/LocalAIServers+11 crossposts

Multi-model consensus debate via the filesystem. LLMs propose, peer-review, rebut, vote and synthesize a group-confirmed answer. CLI + MCP.

github.com

u/raiyanyahya — 3 days ago

▲ 2 r/LocalAIServers+1 crossposts

Looking for Free/Low-Cost Server Resources to Host My Own LLM and Files

Hi everyone,

I'm a student and AI/ML enthusiast working on personal projects. I'm looking for ways to host my own local/open-source LLM (such as Llama, Mistral, or similar models) along with some project files and datasets.

My budget is very limited, so I'm interested in:

Free cloud credits or sponsorship programs

Student programs that provide compute resources

Community grants for open-source or educational projects

Free VPS, GPU servers, or hosting platforms

Any organizations or individuals willing to support student AI projects

My use case is mainly learning, experimentation, and building portfolio projects—not commercial usage.

If you've received free credits from cloud providers, know of any programs I should apply to, or have spare resources you'd be willing to share, I'd greatly appreciate your advice.

Thanks in advance!

reddit.com

u/yami_8809 — 2 days ago

▲ 5 r/LocalAIServers

Refurbished 64GB VRAM AI Server for Local AI: 4x NVIDIA V100/P100, AMD MI25

https://www.youtube.com/watch?v=zp8j4vO-wz0

u/Any_Praline_8178 — 3 days ago

▲ 26 r/LocalAIServers+3 crossposts

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs?

Mac can host large models but the prefill speed sucks, so I tested in it on my setup for Kimi 2.7.

Short answer: it helps prefill, but it does not meaningfully help decode on this setup. RPC is still mostly a capacity tool unless the network/interconnect and split mode are much better.

Setup

Host: Mac Studio M3 Ultra, 512GB unified memory, Metal
Worker: Linux box with NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96GB VRAM, CUDA
Network: direct Ethernet between Mac and Linux box, but only 1GbE in practice
Measured RPC transfer rate: about 112-113 MiB/s
Model: unsloth/Kimi-K2.7-Code-GGUF, UD-Q3_K_XL
Model size on disk: about 432GB across 11 GGUF shards
Runtime: llama.cpp server version 9827 (4c6e0ff3a), Unsloth build

Controlled test

Same synthetic prompt for both runs:

Prompt tokens: 7120
Generated tokens: 64
temperature: 0
ignore_eos: true
Prompt cache disabled
Prefill gain: about 14.8%
Decode gain: about 4.2%
Total request time improvement: about 12.3%

Split trend

The generation columns are - where I only ran prefill. The controlled generation rows used the exact same 7120-token synthetic prompt; the earlier split-sweep rows were around 7.1K prompt tokens but not always the exact same prompt.

Run	RTX share	Split	Prompt sec	Prefill tok/s	Decode	Total	RTX VRAM
Mac	0%	-	53.58	132.88	17.55 tok/s	57.23s	none
Mac + RTX	15%	15,85	51.48	138.3	-	-	69.4GB
Mac + RTX	19%	19,81	50.22	141.77	-	-	84.1GB
Mac + RTX	20%	20,80	49.54	143.72	-	-	93.2GB
Mac + RTX	20%	20,80	46.69	152.49	18.28 tok/s	50.19s	93.3GB
Mac + RTX	21%	21,79	-	failed	-	-	failed

20,80 was the practical max on this card with 128K context.

21,79 failed even at 8K context:

RPC/network trace

For the 7120-token prefill-only 20,80 run:

Mac -> RTX: 251.59 MiB, 2.03s
RTX -> Mac: 194.69 MiB, 1.49s
Total RPC traffic: 446.28 MiB, 3.52s
RTX graph compute: 1.34s

The RPC traffic is mostly hidden activations, not text tokens. For prefill it is chunked/batched, so the network cost is noticeable but not fatal. For decode, the boundary is crossed every generated token, which is why I expected decode to suffer more. In this test decode was roughly the same as Mac-only: 18.28 tok/s vs 17.55 tok/s.

Learnings

I can knock off few more seconds by using a better cable, but not sure it's worth it
It is useful for fitting models/splits that otherwise do not fit one device.

Question: As I was increase the shards, the prefill speed was decreasing, but will this trend continue if I add one more GPU? People with multi GPU setup what's you take on this?

u/No_Run8812 — 4 days ago

▲ 3 r/LocalAIServers+1 crossposts

How can I make my AI project generate more natural responses and reduce hallucinations?

Hi everyone. I’m building my own AI assistant project called NERO. My goal is to make it feel more natural, reliable, and useful — not just a command-based chatbot.

Right now, I’m struggling with two main problems:

The responses still feel robotic or scripted sometimes.

It sometimes hallucinates or gives answers that are not based on my project files.

My current idea is to use:

A better intent/router system

RAG or project file retrieval

Memory for conversation context

Guardrails so it does not invent project facts

Testing with many normal questions and follow-up questions

For people who have built AI assistants, RAG systems, or local LLM projects:

What architecture or techniques actually helped you make responses more natural and less hallucinated?

Should I focus more on better prompts, better retrieval, better routing, evaluation tests, or something else?

Any advice, examples, or resources would really help. Thank you.

u/Mr_Kim__ — 4 days ago

▲ 3 r/LocalAIServers+1 crossposts

I turned a Linux box into a fully-offline, agent-native OS with the whole local-AI stack wired together out of the box. Roast the architecture.

Disclosure up front: I'm the dev, this is my project, and there's a paid version — I'll mention it at the end so it's not a stealth ad. I'm really here for this community's brutal technical feedback, because you'll find the holes faster than anyone.

What it is: a Debian-based OS built around local AI as a first-class citizen instead of a browser tab. Everything runs on your own hardware, fully offline — no cloud, no API keys, no token meter.

Under the hood (no magic — it's open models orchestrated into an OS):

LLMs via Ollama/llama.cpp (Qwen2.5 family + others), auto-tiered to your VRAM
Image: SDXL / Z-Image-Turbo · Video: Wan 2.2 i2v · Voice: Chatterbox TTS + Whisper STT — all local
An agent layer ("Omega") that can actually operate the machine: plan→act with a grounded verify step and a tamper-evident action log
Ships with a curated set of Apache/MIT-licensed models baked into the image, so it generates on first boot with zero downloads and no internet

The point isn't a new frontier model — it's that the whole sovereign stack is integrated, offline, and yours, instead of you gluing 8 repos together.

Honest limits: it's beta, and the local models are smaller than frontier cloud (I don't claim Midjourney/GPT parity — the trade is sovereignty + zero per-use cost, not raw quality).

Genuinely want to know: what would you want in a "local-AI-first OS" that nothing does well yet — and where do you think this approach breaks? (Paid founding beta link in a comment to respect the sub; the feedback is why I'm posting.)

reddit.com

u/New_Canary_9806 — 5 days ago

▲ 6 r/LocalAIServers+1 crossposts

How to build a ai on my local computer

reddit.com

u/Defiant-Standard-547 — 4 days ago

▲ 10 r/LocalAIServers

Catch Me If You Can: MI50/GFX906 -> 119.5 TPS MoE :: 70.2 TPS Dense

Catch me if you can :: Public Benchmark Challenge

New vNext release for LocalAIServers:

https://github.com/joe2gaan/localaiservers/releases/tag/vnext-gfx906-rocm72-gguf-hf-repro

As of 2026-07-01, we have not found a faster public, reproducible result for this exact stack:

Qwen3.6 35B-A3B F16/FP16 MoE or Qwen3.6 27B F16/FP16 Dense

ROCm
vLLM
MI50/GFX906
128K context
single-request decode only

Numbers are discussion. Reproducible packages are leaderboard entries.

Important: this leaderboard is single-request decode only.

No multi-request batching.
No concurrency throughput.
No aggregate multi-user TPS.
No MTP/speculative/draft-model decoding.
No screenshots-only submissions.

Benchmark ladder:

8 warmups -> c1_128 strict -> c1_2000 -> c1_10000

c1 means concurrency 1. The leaderboard metric is strict backend TPS from single-request decode.

Current targets:

Class	Strict TPS	c1_2000	c1_10000
GGUF F16 35B-A3B MoE TP4	119.33–119.52	120.46–120.57	113.26–113.37
GGUF F16 27B Dense TP8	69.85–69.91	70.76–70.96	66.32–66.44
HF FP16 35B-A3B MoE TP4	114.41–115.11	115.69–115.93	108.92–109.10
HF FP16 35B-A3B MoE TP8	114.70–115.04	115.53–115.55	108.67–108.81
HF FP16 27B Dense TP8	70.17	71.32	66.82

Main leaderboard rules:

MI50/GFX906 only
ROCm + vLLM only
HF FP16 or GGUF F16 only
single-request decode only
concurrency 1 only
backend TPS only
128K context required / MAX_MODEL_LEN=131072
same benchmark ladder required
3-run median required
c1_10000 run required
no Q4/Q5/Q6/Q8, FP8, AWQ, GPTQ, NVFP4, etc.
no MTP, EAGLE, DFlash, draft models, speculative decoding, or multi-token prediction
no aggregate throughput from multiple requests, multiple clients, or concurrent batches
screenshots alone do not count
public reproducible package required

TP4 and TP8 MoE are tracked as separate leaderboard lanes. The overall MoE crown goes to the fastest valid strict backend TPS across eligible MoE lanes.

Open lane:

GGUF F16 35B-A3B MoE TP8 currently has no vNext incumbent. Bring a public repro package and it can be added as a new leaderboard lane.

Verification package requirements:

To take a leaderboard slot, submit a public GitHub repo, tagged release, or archive containing:

README.md or REPRO.md with exact reproduction steps
benchmark commands
generated vllm serve artifacts
raw benchmark logs for all runs
model source, revision, and/or SHA256 hashes
GGUF manifests and SHA256 checks, if using GGUF
patch files or patch bundle hashes, if using patches
Docker image name and digest
ROCm version
vLLM version/commit
GPU count and TP size
dtype and max model length
BAR/P2P status
proof that the run is single-request decode / concurrency 1
host notes needed to reproduce the run
script or command sequence that stages inputs and runs the benchmark

The package does not need to redistribute model weights if licensing prevents that, but it must provide exact public fetch instructions, revisions, manifests, and hashes so another person can rebuild the same environment and verify the result.

To dethrone a target, submit a reproducible package with a 3-run median at least 3% higher than the current strict TPS target.

Minimum 3-run median required:

Class	Current best strict TPS	Required to dethrone
GGUF F16 35B-A3B MoE TP4	119.52	123.11+
HF FP16 35B-A3B MoE TP4	115.11	118.57+
HF FP16 35B-A3B MoE TP8	115.04	118.50+
GGUF F16 27B Dense TP8	69.91	72.02+
HF FP16 27B Dense TP8	70.17	72.28+

Reference hardware used for vNext validation

The vNext validation evidence was recorded on two local validation lanes. Host labels are sanitized evidence labels only; they are not public access endpoints and are not required reproduction targets.

Notes:

The release profiles require full-BAR/P2P-on platform state. The live validation query confirmed full 32 GiB BAR0 visibility on all 8 GPUs on both validation lanes.
ROCm-SMI product-name strings may label some devices inconsistently, but the memory-total query and sysfs VRAM totals showed 34342961152 bytes visible per GPU on all 8 GPUs.
The InfiniBand/Ethernet devices are validation-site infrastructure and are not public reproduction requirements.
Users should choose their own local SSD/NVMe-backed LOCAL_MODEL_ROOT, LOCAL_HF_CACHE, and LOCAL_RUNTIME_ROOT values for reproduction.
Per-card unique IDs, GUIDs, MAC addresses, hostnames, private addresses, validation-local mount paths, and management endpoints are intentionally omitted.

Outlaw class is welcome too:

quantized GGUF, MTP, llama.cpp, Vulkan, FP8, NVIDIA, R9700, high-concurrency throughput, weird forks, anything-goes.

Outlaw results do not dethrone the exact-stack leaderboard, but they are still useful for comparison.

If we missed a faster public MI50/GFX906 + ROCm + vLLM + FP16/F16 Qwen3.6 single-request decode result, link it.

If you want to beat the leaderboard, bring a repro package.

u/Any_Praline_8178 — 5 days ago

▲ 5 r/LocalAIServers

Need advice on what hardware to use under $4k

Hey everyone, I'm looking to purchase/build a local AI solution for various open source models coding such as versions of the Qwen 3.6 family and gemma 4 family, various image editing models, text to speech models, text to video models, and possibly some light training. I was looking at the Asus Ascent GX10 for 360,000 INR (around $3700) after the 18% GST discounts, but I'm truly unsure of what I need.

Also, power consumption/heat output is a concern for me, I don't want the room turning into a heat chamber as I see with other multi GPU builds.

I would be going for an M3 Ultra 96GB or m4 Max Mac Studio, but due to the recent price hikes and stock backorder, scalpers have been quoting me around 700,000 INR (Roughly $7500) for a 96gb M3 ultra.

I appreciate any suggestions, please do let me know

reddit.com

u/Late-Brother7489 — 5 days ago

▲ 80 r/LocalAIServers

12VHPWR danger!

So I bought a thermal camera a while back and decided to check out my server with it and boy am I glad I did, one of the 12VHPWR connectors is only supplying power on two conductors. Each of the two wires is supplying 18 amps at my power cap of 450w, I've been running it like this for a while with no visible connector damage yet but that doesn't mean it's ok. Here's a few photos of the cables.

u/TwistedDiesel53 — 6 days ago

▲ 120 r/LocalAIServers+2 crossposts

I found every way to rent an NVIDIA DGX Spark (GB10) so you don't have to — cloud, hourly, and physical

Hello locals,

Kept seeing "where do I actually rent a DGX Spark" questions with no good answer, so I went and catalogued every option I could find. Posting it here in case it saves someone the search.

Remote access (cloud — you rent the GPU, connect over SSH)

Enverge — from $0.65/hr, 128GB, SSH + Docker, hourly pay-as-you-go, no commitment
gb10.studio - mostly for inference
VFX Now (US) — rudimentary cloud access; also offers physical
Primcast — dedicated/monthly hosting rather than hourly

Physical rental (the box ships to you — per week, UK)

HardSoft
Scan — per-week, includes a clunky cloud-access option too

Quick takeaways

For a weekend experiment, hourly cloud is the cheapest by a mile.
If you need it physically on your desk (data residency, air-gapped, privacy), the UK per-week physical rentals are the only real route right now.
Buying is ~$3–4k; rough breakeven vs $0.65/hr is ~5,000+ hours, so unless you're running it near-constantly, renting is the call.

What did I miss? Will edit the list with anything good in the comments.

Anyone DIY-ing this?

u/big-in-jap — 7 days ago

▲ 14 r/LocalAIServers

What is the "best" selfhosted model in July 2026 for general use and coding with this hardware?

I'm looking for some model recommendations that fit well with my current setup:

Intel Core Ultra 7 155H
64GB 7466MHz LPDDR5 (please don't rob me)
Nvidia RTX 5060 Ti 16GB

I mainly plan to it for daily usecases like message sentiment analysis, rewriting mails in different levels of technical depth, surface-level research and related IT / hardware topics. But also as a coding-assistant for Powershell, .SCAD 3D-files, Dockerfiles/Compose and sometimes simple vibecoded tools I use in my homelab.

I would prefer a streamlined workflow where I don't need to swap between more than 2-3 models depending on the task. I just want a few solid "daily drivers." I'am used to Gemini Pro, so if it takes slightly longer to answer, but the quality is way better, thats a tradeoff I'am willing to make.

I’ve dabbled with Ollama + Open WebUI before, but I'm completely open to other backend/frontend suggestions if there's a better way to utilize my hardware.

Thanks in advance for any tips!

reddit.com

u/Flying-T — 6 days ago

▲ 7 r/LocalAIServers+2 crossposts

AMD Radeon PRO V620 on Ubuntu bare-metal: PCI BAR / SR-IOV resource issue with multiple GPUs

TL;DR

What did you do to get your V620 GPUs to work?
How did you get over the cards trying to use ridiculous BARs of 384 GB per card for their SR-IOV/VF function?

^(Disclaimer: I used AI to help me gather all the data and present it in this post cleanly.)

I wanted to share an issue I ran into while trying to use AMD Radeon PRO V620 GPUs on Ubuntu bare-metal for AI workloads, and I’m curious if anyone else has seen the same thing.

Setup

Ubuntu 24.04.4
ROCm 7.2.3
Mac Pro 2019
Cubix Xpander PCIe expansion chassis
AMD Radeon PRO V620 GPUs
Bare-metal Linux only
No virtualization
No passthrough
No MxGPU use case

The goal was simple: use the V620s as normal ROCm GPUs for AI inference.

The problem

The V620s were visible to the system through PCIe, but they did not initialize as usable ROCm GPUs.

lspci showed the cards correctly as:

AMD/ATI Navi 21 [Radeon Pro V620] [1002:73a1]

But they only showed:

Kernel modules: amdgpu

not:

Kernel driver in use: amdgpu

rocm-smi either showed no V620s or only the unrelated internal GPUs, depending on the configuration.

Resource allocation looked broken

The sysfs resource files for the V620s were all zeroed out:

/sys/bus/pci/devices/0000:xx:00.0/resource

0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 0x0000000000000000
...

The V620s also exposed SR-IOV capability even though I was not using virtualization:

sriov_totalvfs=12
sriov_numvfs=0

The SR-IOV capability block showed:

Initial VFs: 12
Total VFs: 12
Number of VFs: 0
VF Device ID: 73ae

The confusing part was that SR-IOV was not actually enabled:

IOVCtl: Enable-
Number of VFs: 0

dmesg errors

During PCI resource allocation, the kernel appeared to account for the possible VF BARs anyway.

The dmesg output had errors like:

BAR 0 [mem size 0x800000000 64bit pref]: can't assign; no space
VF BAR 0 [mem size 0x6000000000 64bit pref]: can't assign; no space
VF BAR 0 [mem size 0x6000000000 64bit pref]: failed to assign

After that, forcing a driver probe did not help. The V620 remained unbound, resources stayed zero, and amdgpu failed during initialization.

Things I tested

While narrowing it down, I tested:

Removing other GPUs
Removing the Apple I/O card
Testing one Cubix cable / one side of the expander
Confirming no ReBAR resize service was active
Confirming sriov_numvfs=0
Setting sriov_drivers_autoprobe=0
Trying late amdgpu probing after boot
Testing boot arguments such as:

pci=realloc
iommu=pt
amdgpu.ras_enable=0

The pattern stayed the same: the cards were present on PCIe, but the V620 BAR resources failed before amdgpu could bind. From the logs, the issue looked related to the very large advertised SR-IOV VF BAR space.

Question

Has anyone else run into this with AMD Radeon PRO V620, especially in bare-metal Linux / ROCm use rather than virtualization?

I’m especially interested in hearing from anyone who has used:

V620 on Ubuntu bare-metal
Multiple V620s in one host
V620 behind PCIe switches or expansion chassis
Cubix or external PCIe expansion systems
ROCm with V620
SR-IOV-capable AMD GPUs where SR-IOV is not actually being used

Did your system allocate the PF BARs normally, or did the VF BARs cause PCI resource allocation problems?

What did you do to over come this problem?

reddit.com

u/Faisal_Biyari — 6 days ago

▲ 0 r/LocalAIServers

native Rust cognitive engine that routes language through a biologically faithful neural substrate

GoldWorm 🐛✨ — 302-Neuron Dual-Stream Cognitive Engine

What Is GoldWorm?

GoldWorm is a native Rust cognitive engine that routes language through a biologically faithful neural substrate — the 302-neuron connectome of Caenorhabditis elegans, the only organism whose entire nervous system has been experimentally mapped (White et al., 1986).

Unlike transformer-based LLMs that rely on billions of parameters and opaque attention mechanisms, GoldWorm operates on three transparent principles:

Biological Fidelity — Every synapse respects the C. elegans topology. No de novo synaptogenesis. No magic matrices.
Dual-Stream Processing — Action (sparse) and Learning (dense) are physically separated, preventing catastrophic forgetting during inference.
Zero-Trust Engineering — Every buffer is strictly bounded. Every path is panic-free. No unwrap() in production code.

Architecture Deep Dive

🧬 The 302-Neuron Connectome

GoldWorm's routing layer is not a generic neural network. It is a topologically accurate model of the C. elegans nervous system:

Neuron Index Range │ Role
───────────────────┼───────────────────────────────────
0   – 19           │ Pharyngeal sub-network (dense)
20  – 91           │ Sensory neurons (input)
92  – 168          │ Interneurons (integration)
99  – 102          │ Command hubs (AVAL/AVAR/AVBL/AVBR)
169 – 301          │ Motor neurons (output)

Connectivity Motifs:

Band synapses — ±1/±2/±3 neighbourhood ring connections
Pharyngeal wiring — Denser internal coupling for neurons 0–19
Sensory → Interneuron — Sparse feed-forward (20–91 → 92–168)
Command interneuron broadcast — Hubs 99–102 broadcast to full motor population 169–301
Interneuron → Motor — Sparse feed-forward projection

All synaptic weights are non-negative and clamped to [0, 1]. The structural blueprint is immutable — Hebbian plasticity only strengthens or weakens existing synapses, never creating new ones.

🌊 Dual-Stream Processing

The core innovation of GoldWorm is the physical separation of Action and Learning:

┌─────────────────────────────────────────────────────────┐
│  INPUT TOKEN  →  128-D Manifold Coordinate              │
│                         │                               │
│    ┌────────────────────┴────────────────────┐         │
│    ▼                                          ▼         │
│ ┌──────────────┐                     ┌──────────────┐   │
│ │  SPARSE      │                     │   DENSE      │   │
│ │  ACTION      │                     │   LEARNING   │   │
│ │  (Post-      │                     │   (Pre-      │   │
│ │   Entmax)    │                     │    Entmax)   │   │
│ │              │                     │              │   │
│ │ ~1-2 active  │                     │ &gt;50% non-zero│   │
│ │ neurons      │                     │ gradient     │   │
│ │              │                     │   substrate    │   │
│ └──────────────┘                     └──────────────┘   │
│       │                                    │            │
│       │ Inference /                          │        │
│       │ Token Selection                        │        │
│       │                                    │        │
│       └────────────────────────────────────┘        │
│                         │                               │
│              Hebbian EchoReservoir                      │
│              (associative memory)                        │
└─────────────────────────────────────────────────────────┘

Why this matters:

Traditional neural networks use the same activation vector for both inference and gradient computation. When different words activate disjoint sets of neurons, the gradient collapses to zero — the network "forgets" what it just learned.
GoldWorm's Dual-Stream keeps the dense pre-entmax signal alive as a gradient substrate, while the sparse post-entmax signal drives token selection. The EchoReservoir learns associations between dense states, not sparse ones.

🧠 The EchoReservoir

A hippocampus-inspired ring buffer of recent pre-entmax states, coupled with a 302×302 Hebbian association matrix W_assoc.

When queried with the current dense state, it returns an echo_bias that nudges the activation toward recently co-active patterns — creating emergent associative memory without external training loops.

Key properties:

W_assoc is symmetric and clamped to [-1.0, 1.0]
History buffer never exceeds capacity (default: 64)
Decay factor controls forgetting rate (default: 0.75)

⚡ Tsallis α-Entmax Activation

GoldWorm does not use softmax. It uses α-entmax, a generalization that interpolates between softmax and sparsemax:

α Value	Behaviour
α = 1	Softmax — dense, all non-zero
α = 2	Sparsemax — exact zeros via simplex projection
α = 3	Sparser than sparsemax — WTA-like

The Quilez Bridge smooth-k parameter k anneals between creativity (dense, k→0) and determinism (sparse, k→∞):

α(k) = 1 + 2·exp(-k)

k = 0     → α = 3  (very sparse, WTA-like)
k = ln(2) → α = 2  (exact sparsemax)
k = ∞     → α = 1  (softmax, all active)

📐 128-D Manifold Geometry

Every token is embedded as a 128-dimensional coordinate on a non-linear manifold, not a flat vector space.

Modified Gram-Schmidt orthogonalization preserves true multi-dimensional variance
Grassmannian fusion computes midpoints between token trajectories on the manifold
Golden-ratio partitioning splits the 128 dimensions into:
- GOLDEN_MAJOR = 79 (coarse, feedforward)
- GOLDEN_RESIDUAL = 49 (fine-grained, feedback)
- GOLDEN_OVERLAP = 5 (cross-binding bridge)

No scalar cloning across dimensions. No arithmetic shortcuts. Spatial variance is preserved at every step.

Features

🖥️ 1. Interactive Observation Dashboard

Watch the hippocampus form associations in real time.

cargo run --release --bin observe

The dashboard displays:

Activation topography — 302-D state as a 19×16 heatmap
Synaptic criticality — σ, creativity, determinism ratios
Jaccard drift — How rapidly the dense learning signal changes
Resonance trace — Recent associative chain
Hebbian strength histogram — Distribution of association weights
Live CLI — /alpha, /kappa, /auto to modulate cognition parameters

┌────────────────────────────────────────────────────────────────┐ │ GoldWorm Observation Dashboard — Step 0 │ ├────────────────────────────────────────────────────────────────┤ │ Top-10 Active Neurons: [99, 101, 169, 170, 171, 172, ...] │ │ Synaptic Criticality: σ=1.0000 creative=0.0000 det=0.5000 │ │ Hebbian Strength: mean=0.00 median=0.00 max=0.00 │ │ Jaccard Drift: 0.0000 (stable) │ │ Echo Reservoir: 0/64 states │ │ Temperature: 0.50 │ │ Resonance Trace: (empty) │ │ Synapse Topography: │ │ ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ ░░░░████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ ░░░░░░░░████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ │ ... │ └────────────────────────────────────────────────────────────────┘ >>

💬 2. Associative Chat

A conversational REPL that learns associations in real time through EchoReservoir Hebbian updates.

cargo run --release --bin associative_chat

How it works:

Each user input is tokenized and routed through the 302-neuron connectome
The dense pre-entmax signal is captured into the EchoReservoir
The reservoir's Hebbian association matrix W_assoc updates automatically
Every subsequent response is biased by the accumulated associative memory
The response is decoded via 302-D Boltzmann energy minimization

Zero-trust decoding properties:

Anti-repetition penalty (banned words reduce similarity by 0.3)
Temperature clamped to [0.01, 5.0]
Max 15 words per response (bounded generation)
All scores clamped to [-1, 1] before Boltzmann draw

Zero-Trust Engineering

GoldWorm is designed for environments where every byte matters:

Guarantee	Implementation
OOM-safe	All matrices are pre-allocated with fixed bounds. No dynamic growth during inference.
No hidden training	The public release contains no training pipeline. `observe` and `associative_chat` are the only binaries.
Panic-free	Every fallible path returns `Result<T, CoreError>`. No `unwrap()` or `expect()` in production code.
Bounded buffers	EchoReservoir capacity: 64. Response max: 15 words. Input projection: 302×128. Synapses: 302×302.
Deterministic	All randomness uses seeded `fastrand` with fixed seed (42) for reproducible behaviour.
No external bloat	9 dependencies. No `tokio`, `axum`, `reqwest`, `chrono`, or `tokio`.

Module Map

Module	Responsibility
`geometry`	128-D token coordinates, MGS orthogonalization, Grassmannian fusion, `atan2` geodesics
`bridge`	Token/logit projection via RPITIT batch traits
`worm_brain`	302-neuron connectome routing, α-entmax, signal propagation
`hippocampus`	Dual-Stream EchoReservoir + Hebbian association learning
`observation`	ANSI dashboard rendering, Jaccard drift monitoring
`storage`	Safetensors checkpoint save/load
`criticality`	Quilez smooth-k annealing for creativity/determinism
`training`	Hebbian plasticity engine with Maxwell damping
`tda`	Topological Data Analysis for activation landscape monitoring
`memory`	Synaptic echo buffer + trajectory vault
`neuron`	Dendritic tree structures (placeholder for quad-routing)

Technical Specifications

Parameter                          Value
────────────────────────────────────────────────────────────
Neuron count                       302  (C. elegans)
Manifold dimension                 128
Input projection                   302 × 128
Synaptic adjacency matrix          302 × 302 (sparse struct)
EchoReservoir capacity             64 states
EchoReservoir associations         302 × 302 dense
Max response tokens                15
Vocabulary                         10,000+ words (static)
Synapse weight range               [0.0, 1.0]
Association weight range           [-1.0, 1.0]
Temperature range                  [0.01, 5.0]
Rust edition                       2024
Minimum Rust version               1.85

Quick Start

Prerequisites

Rust 1.85+ (rustup update)
A trained_worm_v1.safetensors checkpoint (or the engine will boot from a fresh baseline)
static_vocabulary.txt (10,000+ word list, one per line)

Observation Dashboard

cargo run --release --bin observe

Commands:

/help — Show all commands
/alpha <f32> — Set echo blend strength (0.0–1.0)
/kappa <f32> — Set gate threshold (0.0–1.0)
/auto — Toggle auto-refresh mode (250ms)
/quit — Exit

Associative Chat

cargo run --release --bin associative_chat

Type naturally. The EchoReservoir learns associations between your inputs and its responses in real time. No external training loop is required.

Optional: CUDA Acceleration

cargo run --release --features cuda --bin observe

Requires candle-core with CUDA support and an NVIDIA GPU.

The Science Behind GoldWorm

Why C. elegans?

Caenorhabditis elegans is the only organism with a completely mapped connectome. Every neuron (302), every synapse (~7,000), and every gap junction has been catalogued by electron microscopy (White et al., 1986). This makes it the ideal substrate for a transparent, inspectable AI — no black-box weights, no billion-parameter mysteries.

Why Dual-Stream?

The brain separates what you do (sparse action) from what you learn (dense prediction). If a network tries to learn from its own sparse outputs, it collapses into a self-reinforcing loop. GoldWorm's Dual-Stream design ensures that associative learning happens on the full, dense signal, while action selection happens on the sparse, efficient signal.

Why Hebbian?

"Neurons that fire together, wire together." Hebbian plasticity is the simplest, most biologically grounded learning rule. It requires no backpropagation, no gradient descent, no external optimizer. It is local, online, and O(n) — perfect for a zero-trust engine that must run on a single CPU core.

License

MIT — See LICENSE for details.

>https://github.com/loslos321-lab/GoldWorm.git

u/CraigWidow — 6 days ago

▲ 24 r/LocalAIServers

What Custom AI Workstation Matches a Fully Loaded MacBook Pro M5 Max?

I’m curious what workstation would provide similar or better performance for AI workloads compared to MacBook Pro 16” M5 Max 128gb unified memory with 2TB SSD.

The metrics I’m most interested in are:
- LLM inference speed
- Model loading time
- Fine tuning and training performance
- Running large models locally
- Overall AI development experience

I’d appreciate recommendations for GPU(s), CPU, RAM, Storage and Estimated cost

reddit.com

u/learn_all — 7 days ago

▲ 11 r/LocalAIServers

Running a fully offline RAG on a Corsair AI Workstation 300 (Strix Halo, 128GB unified) — from scratch, no cloud

Saved up and grabbed a Corsair AI Workstation 300 a couple months back. Strix Halo, 128GB unified memory. I'd been eyeing a proper local box forever and finally talked myself into it — because paying OpenAI rent every month to read my own documents back to me started feeling a little stupid 😅

First real project: an offline RAG over my book collection. Built from scratch, no frameworks, all through Ollama. Nothing leaves the machine — that was the one rule I refused to break. My data, my box, my problem. The way it should be 🔒

The unified memory is the part that genuinely sold me. I keep a 122B-class model sitting in memory right next to the embedding and reranker models, all at once, no VRAM juggling. If you've ever played the "unload this to load that, oh wait now reload the first one" game, you know exactly the pain I'm escaping 😂 Watching the whole pipeline run local without it still feels a little unreal.

Now, was it all smooth? Absolutely not 💀 The AMD/ROCm side fought me like it owed me money. Lost a few evenings to stuff that "should just work" (we've all been there, don't lie). But every time it actually answers a question straight out of one of my books, I forgive it instantly 🥲

Curious what the rest of you are running on Strix Halo / unified-memory rigs — and what you'd point this thing at next. Pretty sure I've barely scratched what it can do, and I'm having way too much fun finding out.

reddit.com

u/Hungry-Horror-7577 — 7 days ago

Field	`.20` validation lane	`.30` validation lane
System vendor/model	GIGABYTE `G292-Z20-00`	GIGABYTE `G292-Z20-00`
System firmware	`R23`, firmware date `2021-09-06`	`R23`, firmware date `2021-09-06`
CPU	1x AMD EPYC 7F32 8-Core Processor	1x AMD EPYC 7F32 8-Core Processor
CPU topology	8 cores / 16 threads, SMT on, 1 socket	8 cores / 16 threads, SMT on, 1 socket
CPU clocks reported	min `2500 MHz`, max `3700 MHz`, boost enabled	min `2500 MHz`, max `3700 MHz`, boost enabled
L3 cache	`128 MiB`	`128 MiB`
System memory	`125 GiB` visible	`125 GiB` visible
OS	Ubuntu `24.04.2 LTS`	Ubuntu `24.04.2 LTS`
Kernel	`6.8.0-52-generic`	`6.8.0-52-generic`
ROCm-SMI driver version	`6.8.5`	`6.8.5`
Root disk	`447.1G` Crucial `CT480BX500SSD1` SATA SSD	`447.1G` Crucial `CT480BX500SSD1` SATA SSD
Local model/runtime NVMe	`1.7T` KIOXIA `KCD6XLUL1T92`; validation-local mount path omitted	`1.7T` KIOXIA `KCD6XLUL1T92`; validation-local mount path omitted
GPU count	8x AMD GFX906 / Vega 20	8x AMD GFX906 / Vega 20
GPU PCI device	`1002:66a1`, rev `02`	`1002:66a1`, rev `02`
GPU SKU/subsystem	SKU `D1631700`, subsystem `0x0834`	SKU `D1631700`, subsystem `0x0834`
GPU VBIOS	`113-D1631700-111` on all 8 GPUs	`113-D1631700-111` on all 8 GPUs
GPU VRAM visible	`34342961152` bytes per GPU, all 8 GPUs	`34342961152` bytes per GPU, all 8 GPUs
GPU BAR0 visible	`34359738368` bytes per GPU, all 8 GPUs	`34359738368` bytes per GPU, all 8 GPUs
GPU BAR2 visible	`2097152` bytes per GPU, all 8 GPUs	`2097152` bytes per GPU, all 8 GPUs
GPU PCI bus IDs	`06:00.0`, `09:00.0`, `45:00.0`, `48:00.0`, `89:00.0`, `8c:00.0`, `c5:00.0`, `c8:00.0`	`06:00.0`, `09:00.0`, `45:00.0`, `48:00.0`, `89:00.0`, `8c:00.0`, `c5:00.0`, `c8:00.0`
NUMA reporting	GPU NUMA node reports `-1`; local CPU list `0-15`	GPU NUMA node reports `-1`; local CPU list `0-15`
BMC/display adapter	ASPEED VGA controller present	ASPEED VGA controller present
Fabric/network observed	Mellanox InfiniBand present; additional Mellanox Ethernet present	Mellanox InfiniBand present; additional Mellanox Ethernet present

r/LocalAIServers

I asked Codex to optimize DeepSeek V4 Flash 8-bit MLX on oMLX. Got ~1.6x prefill and ~3x decode speedup.

Base oMLX work

What changed for 8-bit affine

Results

Accuracy / correctness question

Next optimization direction?

Anyone tested these?

10x RTX 6000 PRO

Multi-model consensus debate via the filesystem. LLMs propose, peer-review, rebut, vote and synthesize a group-confirmed answer. CLI + MCP.

Looking for Free/Low-Cost Server Resources to Host My Own LLM and Files

Refurbished 64GB VRAM AI Server for Local AI: 4x NVIDIA V100/P100, AMD MI25

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

Setup

Controlled test

Split trend

RPC/network trace

Learnings

How can I make my AI project generate more natural responses and reduce hallucinations?

I turned a Linux box into a fully-offline, agent-native OS with the whole local-AI stack wired together out of the box. Roast the architecture.

How to build a ai on my local computer

Catch Me If You Can: MI50/GFX906 -&gt; 119.5 TPS MoE :: 70.2 TPS Dense

Catch me if you can :: Public Benchmark Challenge

Reference hardware used for vNext validation

Need advice on what hardware to use under $4k

12VHPWR danger!

I found every way to rent an NVIDIA DGX Spark (GB10) so you don't have to — cloud, hourly, and physical

Remote access (cloud — you rent the GPU, connect over SSH)

Physical rental (the box ships to you — per week, UK)

Quick takeaways

What is the "best" selfhosted model in July 2026 for general use and coding with this hardware?

AMD Radeon PRO V620 on Ubuntu bare-metal: PCI BAR / SR-IOV resource issue with multiple GPUs

TL;DR

Setup

The problem

Resource allocation looked broken

dmesg errors

Things I tested

Question

native Rust cognitive engine that routes language through a biologically faithful neural substrate

GoldWorm 🐛✨ — 302-Neuron Dual-Stream Cognitive Engine

What Is GoldWorm?

Architecture Deep Dive

🧬 The 302-Neuron Connectome

🌊 Dual-Stream Processing

🧠 The EchoReservoir

⚡ Tsallis α-Entmax Activation

📐 128-D Manifold Geometry

Features

🖥️ 1. Interactive Observation Dashboard

💬 2. Associative Chat

Zero-Trust Engineering

Module Map

Technical Specifications

Quick Start

Prerequisites

Observation Dashboard

Associative Chat

Optional: CUDA Acceleration

The Science Behind GoldWorm

Why C. elegans?

Why Dual-Stream?

Why Hebbian?

License

What Custom AI Workstation Matches a Fully Loaded MacBook Pro M5 Max?

Running a fully offline RAG on a Corsair AI Workstation 300 (Strix Halo, 128GB unified) — from scratch, no cloud

Catch Me If You Can: MI50/GFX906 -> 119.5 TPS MoE :: 70.2 TPS Dense