r/LocalAIServers

▲ 13 r/LocalAIServers+5 crossposts

Mac Pro 2019 Local AI Guide: Ubuntu 24.04, ROCm 7.2.3, PyTorch 2.10, and Infinity Fabric Link

I am very excited about the future of local AI. With the spread of AI agents, the amount of VRAM now achievable locally, the quality of small and medium LLMs, and the community growing around all of this, the future is looking very good.

I am writing this to document my successes with the following:

  • Mac Pro 2019
  • Ubuntu 24.04.4 LTS (Ubuntu Server specifically, in my case)
  • Dual AMD Radeon PRO W6900X with Infinity Fabric Link Bridge
  • Dual AMD Radeon PRO W6800X Duo with Infinity Fabric Link Bridge
  • ROCm 7.2.3
  • PyTorch 2.10
  • Triton 3.6
  • vLLM (Write up pending)
  • Hermes Agent (Research Pending)

I wrote a couple of old guides. Check them out for reference, as needed:

I'm going to focus on setting up Ubuntu and all the packages needed for the infrastructure of local AI.

Important: This is an experimental community guide. Some parts involve patched kernels, unsupported GPU configurations, and boot-level PCIe changes. This worked for my Mac Pro 2019 systems, but you should expect troubleshooting, and you should be comfortable recovering from a failed boot. I am not responsible for any outcome of using this guide, whether it be positive, negative, or anything in between.

1. Choices & Decisions

  • Mac Pro 2019: It's what I had available to me.
  • W6900X: It's what I had available to me.
  • W6800X Duo: It's what I had available to me.
  • Ubuntu LTS: The ROCm-supported OS family I am most comfortable with. Alternative: RHEL
  • Ubuntu 24.04 LTS: The latest Ubuntu LTS version supported by ROCm at the time of writing. Alternative: Ubuntu 22.04 LTS
  • Ubuntu Server: To avoid desktop overhead and keep the system headless. Alternative: Ubuntu Desktop LTS
  • Data Room: I placed the Macs in a Data Room, so I don't hear the loud fans. Alternative: Place it at your desk, or anywhere else.
  • DRM/AMDGPU: I opted to use the GPU driver in the kernel, to patch it to support the Infinity Fabric Link Bridge. Alternative: Install DKMS and AMDGPU.
  • Kernel: Patched Ubuntu 6.17 HWE kernel, based on Ubuntu’s linux-hwe-6.17 source package, to support the Infinity Fabric Link Bridge. Alternative: Standard Ubuntu kernel.
  • ROCm: AMD’s CUDA alternative for AMD GPUs. Alternative: Vulkan
  • ROCm 7.2.3: Latest ROCm that supports my GPUs at the time of writing. Alternative: Outdated ROCm.
  • vLLM: Concurrent utilization of loaded LLMs. Alternative: Ollama & Llama.cpp
  • Hermes Agent: More tool-savvy and self-learning. Alternative: OpenClaw
  • GitHub: All my files and commands have been uploaded to GitHub, to make this guide shorter than 40,000 characters. Alternative: Multiple Guides...

Please let me know if the GitHub links do not work.

These are the choices I made, and I am still refining them. They work for me. Keep in mind that this is all held together with the digital equivalent of duct tape. If you change anything, it may or may not work. If you do, I would genuinely appreciate hearing what you tried, what worked, what failed, and why you changed it.

2. Setting up Ubuntu after Installation

Step 00: Infinity Fabric Link (Jumper & Bridge)

Please remove the Infinity Fabric Link Jumper(s) or Bridge from the GPU. Ubuntu 24 kernels do not currently support it, as of 6.17.

Specifically, with kernel 6.8, none of the GPUs will work. When upgrading to 6.17, only one GPU will work.

If you have an Infinity Fabric Link Jumper or Bridge, follow the patch section later in the guide to make it work with your GPUs.

Step 01: Update, Upgrade, and Tweak the System

What we will do:

  • Change ubuntu.sources from http to https
  • Attach to Ubuntu Pro (This is optional, and requires interaction)
  • Update & Full-Upgrade
  • Upgrade to the latest HWE kernel
  • Remove cloud-init
  • Make all Ethernet ports accept DHCPv4 automatically
  • Modify Grub to include "loglevel=7 log_buf_len=16M iommu=pt" kernel flags
  • Reboot

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2001%3A%20Update%2C%20Upgrade%2C%20and%20Tweak%20the%20System" | bash

Step 02: Install T2 Linux Repository

Since we are using a Mac Pro 2019, which is a Mac with a T2 chip, some additional packages are required to be able to properly communicate with the hardware.

What we will do:

  • Set up the T2 Ubuntu 24 (Noble) Repository
  • Install 3 Packages: applesmc-t2 apple-bce t2fanrd
  • Reboot

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2002%3A%20Install%20T2%20Linux%20Repository" | bash

Step 03: Enable T2 Fan Daemon

After installing the T2 packages, the command below is used to activate the fan service.

What we will do:

  • Enable the t2fanrd systemd service

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2003%3A%20Enable%20T2%20Fan%20Daemon" | bash

Step 03-Optional: Set Fans to Maximum

I do not trust Apple Cooling. I would rather the fans wear out and replace them for a few dollars, versus the GPUs (especially the Duo models) being damaged due to overheating.

What we will do:

  • Set all 4 fans to maximum speed

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2003-Optional%3A%20Set%20Fans%20to%20Maximum" | bash

Step 04: Download and Install ROCm 7.2.3

This section will install ROCm 7.2.3, but it will NOT install dkms or amdgpu drivers. I opted to use the kernel driver, drm/amdgpu, so I can later patch it to support the Infinity Fabric Link Bridge.

What we will do:

  • Make a new directory to save all downloaded files
  • Download ROCm installer
  • Install ROCm Dependencies
  • Install ROCm
  • Give all users access to ROCm
  • Add ROCm to path
  • Show you a bunch of output displaying your GPUs, which are working with ROCm or the driver, etc.
  • Reboot

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2004%3A%20Download%20and%20Install%20ROCm%207.2.3" | bash

Step 05: Install Python Tools

We will be using Python and pip to install several packages for local AI. The following commands are to set up the correct versions, as well as some quality of life choices.

What we will do:

  • Install these packages: 2to3 python-is-python3 python3-pip python3-venv python3-dev python3-setuptools
  • Install or upgrade these packages, system wide: pip wheel setuptools
  • Install numpy 1.26.4 specifically, system wide

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2005%3A%20Install%20Python%20Tools" | bash

Step 06: Install PyTorch & Other ROCm Related Wheels

Not everything here is needed for everyone. I included what I could, what worked, and what had some value to some local AI use case.

What we will do:

  • Install PyTorch Wheels
  • Add AMD ROCm APT Repository
  • Set AMD ROCm Apt Repository at priority 700 (Higher than Ubuntu)
  • Fix some ROCm Symlinks conflicting with MIGraphX
  • Install MIGraphX & Half packages
  • Install ONNX Runtime package
  • Install TensorFlow ROCm package
  • Install Apex Wheel
  • Clean up packages
  • Reboot

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2006%3A%20Install%20PyTorch%20%26%20Other%20ROCm%20Related%20Wheels" | bash

Step 07: Verifying Everything

We just completed installing everything in the standard way. We just need to verify that everything is now set up correctly.

What we will do:

  • Give you several boxes showing the status of everything we just set up

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2007%3A%20Verifying%20Everything" | bash

3. Infinity Fabric Link Jumper / Bridge

AMD released several GPUs specifically for the Mac Pro 2019 that support their Infinity Fabric.

These GPUs and the Infinity Fabric Links are discussed in these posts:

The first set of GPUs that support it were the AMD Radeon PRO Vega II & Vega II Duo. The PC equivalent is an AMD Radeon PRO VII, which also supports an Infinity Fabric Link.

The second set of GPUs are the AMD Radeon PRO W6800X, W6800X Duo, and W6900X. These GPUs are in the Sienna Cichlid family of GPUs. Also referred to as RDNA2.

At the announcement of the Sienna Cichlid family, these GPUs were marketed as supporting xGMI. The Infinity Fabric Link is the physical bridge / jumper. xGMI is the software path that allows the GPUs to communicate over that link. However, on release, only the Apple MPX GPUs actually supported the Infinity Fabric Links, while the standard versions did not.

This might explain why support for xGMI on Sienna Cichlid was added between 2019 and 2020 to the Linux kernel drm/amdgpu, but later removed in 2022.

Many of us here in the subreddit tried to figure out the problem with the Infinity Fabric Link, and tried to find a solution to it. One such redditor actually cracked it; creating a patch to the current kernel drm/amdgpu driver, which through my testing seems to have completely solved the Infinity Fabric Link regression that happened in 2022.

You'll need to keep in mind that this is just the first step. While we are moving forward, there is still the question of ROCm support, HIP support, and everything else.

Step 01: Download, Build, & Install the Patched Kernel Files

Let's start. We will do the following:

  • Make a directory to download kernel source
  • Install packages required to patch the kernel
  • Activate the source to download kernel source
  • Patch drm/amdgpu
  • Build a full patched kernel
  • Install the patched kernel

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/3.%20Infinity%20Fabric%20Link%20Jumper-Bridge/Step%2001%3A%20Download%2C%20Build%2C%20%26%20Install%20the%20Patched%20Kernel%20Files" | bash

With this, you are now the proud user of a patched kernel that supports the Infinity Fabric Links on the Sienna Cichlid MPX GPUs.

At this point, shut the system down, reinstall the Infinity Fabric Link Jumper or Bridge, then boot back into the patched kernel.

Step 02: Verify Patched Kernel & GPU Initialization

We should probably run a verification one last time. Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/3.%20Infinity%20Fabric%20Link%20Jumper-Bridge/Step%2002%3A%20Verify%20Patched%20Kernel%20%26%20GPU%20Initialization" | bash

While more testing is still needed, this is quite the achievement for the community. Thank you again, anonymous redditor.

4. AMD Duo MPX GPUs and Setting BAR Correctly

I have been using my Mac Pro 2019 with Dual AMD Radeon PRO W6800X Duo for local AI inference for some time now, and I have not had any BAR-related problems. However, since I moved from using Proxmox to having Ubuntu 24 on bare-metal, I have started noticing some BAR warnings and errors.

It seems that this problem may come from the way the Mac Pro firmware allocates PCIe resources before Linux takes over, specifically when using Duo MPX GPUs.

One redditor, whose account is now deleted, shared a GitHub link to what I can only describe as someone's documentation of how he fixed the BAR issue on Vega II Duo GPUs. I have dubbed this the nbritton's method.

Our goal now is to use nbritton's method, adapted for the W6800X Duo. I tried to make it also work as a copy and paste solution for the Vega II Duo as well, but I have not tested it.

Warning: This changes GPU driver load order and PCIe BAR allocation behavior. If something goes wrong, you may need to boot from a recovery kernel, remove the service, or undo the GRUB changes. Also, note that SGLang's AMD GPU documentation recommends pci=realloc=off iommu=pt, which conflicts with nbritton's method because nbritton's method depends on PCIe BAR reallocation behavior. In other words, pci=realloc must not be disabled for this method.

Let's start.

We will do the following:

  • Blacklist amdgpu
  • Add pci=realloc to grub
  • Configure resize-gpu-bars.service
  • Set up nbritton's method files

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/4.%20AMD%20Duo%20MPX%20GPUs%20and%20Setting%20BAR%20Correctly" | bash

5. Finalize the Infrastructure

After completing the linked sections above, we should have:

  • Install Ubuntu (You did this on your own or using a previous guide)
  • Prepare Ubuntu's environment
  • Set up T2 related environment
  • Installed ROCm
  • Installed PyTorch and several other local AI optimizing software
  • Patched the kernel (linux-hwe-6.17, source 6.17.0-29.29~24.04.1) to support xGMI and the Infinity Fabric Link Bridge and Jumper.
  • Set up nbritton's method for Duo MPX GPUs BAR correction

Once you're done, please reboot to make sure everything sticks. Then repeat step 07: Verify Everything, above to verify everything is correct and as it should be.

6. Local AI

Now that the infrastructure is ready, it's time to move to our frameworks of choice.

While I definitely plan to expand, I have focused mainly on text generation. When I first started, consideration was Ollama, Llama.cpp, and vLLM. I see new options now, such as SGLang as well.

I am excited to share that vLLM supports this setup and works well. I hope to release a separate guide for it soon.

For the purpose of this guide, I will continue with Ollama, for the simplicity of it, and a Hello World type scenario.

Step 01: Install and Configure Ollama

We will do the following:

  • Set up Ollama
  • Fix ollama.service vs. ollama serve separate model libraries

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/6.%20Local%20AI/Step%2001%3A%20Install%20and%20Configure%20Ollama" | bash

Step 02: Verify Ollama Setup

We will do the following:

  • Verify Ollama services and data folders permissions

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/6.%20Local%20AI/Step%2002%3A%20Verify%20Ollama%20Setup" | bash

Step 03: Download and Run Models

We will do the following:

  • Download and run our first model

Copy the following command into your command line interface of choice:

ollama run qwen3.5:0.8b --verbose

You can find more models on Ollama's website. Below are some other models I am considering:

ollama pull qwen3.6:27b
ollama pull gemma4:31b-it-q4_K_M
ollama pull granite4.1:30b
ollama pull medgemma:27b
ollama pull mistral-medium-3.5:128b
ollama pull gpt-oss:120b
ollama pull qwen3.5:122b
ollama pull nemotron-3-super:120b

7. Done

With this, we are done with this guide.

It has been a long journey setting up this infrastructure, and preparing for the actual goal.

My testing was done on Mac Pro 2019 systems with dual W6900X MPX modules and dual W6800X Duo MPX modules. I have not tested this with Vega II or Vega II Duo MPX GPU modules.

Next, I plan to focus on vLLM for a while. Optimization, quantization, and automation of operations.

After that, I hope to dive into Hermes Agent by Nous, with the hope of building multiple agents around a few local models run on vLLM, communicating and working together.

Expanding to images or vision, as well as to voice, is also down the pipeline.

The possibilities are endless. I hope to hear what everyone else experiences with this guide and with local AI in general: what worked, what failed, what workloads you are running, what use cases you care about, what problems you hit, and what solutions you found.

Looking forward to seeing how everyone takes advantage of this guide, and local AI.

8. Credit

Credit where credit is due. A lot of the information here was gathered from the community in bits and pieces.

I do want to take the opportunity to thank the anonymous redditor for his/her contribution (creating the whole kernel patch). THANK YOU!

  • Nikolas Britton for the nbritton method, fixing the BAR issue on the AMD Duo MPX GPUs.

  • u/AdityaGarg8 for always being supportive, no questions asked.

  • My AI of choice, for the support through all of this.

  • r/MacPro2019LocalAI redditors, for keeping in touch, and motivating me to continue going. You guys are the real MVPs.


Disclaimer: I wrote this post myself. I also used AI as a tool to help clean up the wording and formatting.

Resources:

reddit.com
u/Faisal_Biyari — 1 day ago

I need some advice about my future computer to rum AI models locally

I am a psychologist.

I want to run AI locally for confidentiality reasons.

What I want to do is take the audio files from my sessions (with the patient's consent), transform it into a *.srt files via Whisper / faster-whisper and run that file to make my notes, get insights about sessions from the AI, write some reports, analyze the interventions of my supervisees, etc.

i would like to know what the community would think of this setup:

1 x Lian Li LANCOOL 217 Noir Tempered Glass ATX Mid-Tower (LAN217X)

1 x ASUS TUF Gaming 1200W 80 Plus Gold ATX 3.0 PCIe 5.0 Alimentation Modulaire Complète (TUF-GAMING-1200G)

1 x ASUS ProArt X870E-CREATOR WIFI AM5 ATX AMD X870E 4xDIMM DDR5 4xM.2 USB4 10Gb+2.5Gb LAN

Wi-Fi 7+BT Motherboard

1 x AMD Ryzen 7 9700X 3.8/5.5Ghz 8C/16T Socket AM5 65W ZEN 5 CPU Processor (100-100001404WOF) 449.99 $

1 x Arctic Liquid Freezer III Pro 240 Noir 240mm AIO Liquid CPU Cooler (ACFRE00178A) 139.99 $

1 x Kingston Fury Beast DDR5 Noir 5600MHz 64GB Kit (2x32GB) CL36 AMD EXPO RAM (KF556C36BBEK2-64)

1 x MSI GeForce RTX 5070 Ti 16G SHADOW 3X OC GDDR7 PCIe 5.0 1xHDMI/3xDP Video Card (G507T-16S3C)

1 x Lexar NM790 1TB NVMe PCIe Gen4 x4 M.2 80mm SSD (LNM790X001T-RNNNG)

If I had more money to spend, should I take a second video card or buy more RAM?

Would you have any replacement suggestions? Anything you would make different?

reddit.com
u/seb734 — 1 day ago
▲ 45 r/LocalAIServers+4 crossposts

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

Hey r/DeepSeek,

Who says we need an H100 cluster or the latest expensive GPUs to run frontier MoE models? I wanted to see how far we could push a single node of consumer legacy hardware, so we spent less than $2,500 total to build a budget machine that successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally!

Surprisingly, we managed to hit around 255 prefill tokens/s with a very tight memory budget.

https://preview.redd.it/cfefgc71732h1.png?width=1772&format=png&auto=webp&s=5c673acca7a2a73cfbd0d2059e25102462c56dfc

Here is a quick breakdown of how we achieved this "legacy donkey pulling a massive MoE chariot" feat via hardware-software co-optimization:

⚡️ The Technical Breakthroughs

  1. Custom Turing CUDA Kernels: The 2080 Ti Tensor Cores are still capable, but PCIe Gen3 and VRAM bandwidth are huge bottlenecks. We rewrote custom CUDA kernels tailored specifically for the Turing architecture to accelerate W8A8 (INT8) matrix multiplication, heavily alleviating the bandwidth choke.
  2. Heterogeneous Inference: Optimized static memory splitting and dynamic offloading between the 4x 11/22GB VRAM and 1TB system RAM. 100% of the hardware capacity is utilized.
  3. Computation-Communication Overlap: Implemented a pipelined execution strategy to hide the massive multi-GPU communication overhead caused by MoE routing.

https://preview.redd.it/5ltwol3z632h1.png?width=2414&format=png&auto=webp&s=6c4c4dcf62737f7f5dcb9a5b8d4aa3f422f7edae

🖥️ Budget Hardware Specs

  • CPU: Intel Xeon E5-2696 v4 (The classic budget king for multi-core)
  • GPU: 4x RTX 2080 Ti (11/22GB each)
  • RAM: 1TB DDR4 ECC

The entire implementation, deployment script, and preliminary tech report are 100% open-sourced. I'd love to hear your thoughts, benchmarks, or feedback from fellow system/compiler hackers here!

🔗 GitHub Repository:https://github.com/lvyufeng/deepseek-v4-2080ti

(Note: I submitted the detailed report to arXiv a few days ago, but it’s currently caught in the manual moderation queue—likely because a rookie author throwing a 2080 Ti at DeepSeek-V4 triggered their review boundaries lol. Will update with the arXiv link once it's cleared!)

https://reddit.com/link/1thlbwe/video/lxhccfh2732h1/player

reddit.com
u/Known_Ice9380 — 3 days ago
▲ 3 r/LocalAIServers+2 crossposts

[Benchmarking] Running 3 LLMs concurrently inside a strict 10MB VRAM budget at 0.12ms/token (Empirical Results)

There is a common consensus that to run multiple LLMs concurrently at high throughput, you need a high-end setup with massive VRAM allocations. I wanted to test the limits of what is possible on standard, everyday consumer hardware.

I compiled and ran a benchmark of NexaQuant v2.0, an inference engine optimized for 1.58-bit Ternary QuantizationVRAM Virtualization (M3), and SIMD AVX2/FMA/GPU assembly-level kernels.

Here are the empirical results, latency numbers, and memory metrics recorded on standard consumer hardware.

📊 1. Memory Overhead & Swapping Latency

We mapped three models simultaneously (Alpha: 4MB, Beta: 8MB, Gamma: 12MB) under a strict, artificially enforced 10 MB VRAM budget.

  • Zero-Copy Memory Overhead: 0.0% double-buffering. By backing the GGUF models directly in Host RAM via mmap, physical memory mapping overhead was literally zero.
  • Dynamic Layer Eviction (LRU): When a model activation exceeded the VRAM budget, the scheduler freed old layers and loaded the target weights.
  • Page-In / Eviction Latency: $< 0.1$ milliseconds. Because 1.58-bit ternary layers are extremely compact, weight swapping between CPU host memory and GPU memory cache slots is virtually instantaneous, causing zero user-perceivable bottleneck.

⚡ 2. Latency & Core Performance (CPU AVX2/FMA SIMD)

When running in classic interactive chat mode with a real TinyLlama GGUF model:

Metric Measured Value Note
Token Latency 0.12 ms / token Extremely low latency on consumer CPU cores
Throughput 8.2 GB / s FMA/AVX2 cache optimization active
Layer Processing > 500,000 layers / sec Highly optimized zero-branching assembly logic
Core Affinity Efficiency 100% Physical Core Pinning Avoids hyperthreading context-switching overhead

🖥️ 3. Multitasking Efficiency & Background Footprint

To test real-world resilience, we ran the benchmark under an active multitasking workload:

  • Host OS running background processes (AI agent executor, compilation tools, system services).
  • Google Chrome open with active, content-heavy tabs.

Results under load:

  • Zero CPU Throttling: Thanks to hardware-specific pinning, the engine maintained stable latencies with less than 1.5% jitter even when system threads fluctuated.
  • Colder Execution: By replacing standard matrix multiplications with optimized ADD/SUB operations (due to ternary $-1, 0, 1$ states), the CPU remained colder, preventing thermal throttling during extended inference loops.

🧪 Automated Math Verification (Integrity Test)

Before running the benchmarks, our automated test suite (tests.cpp) verified the mathematical precision of the AVX2 SIMD kernel against a double-precision sequential reference run:

  • Expected Output (Sequential): -0.500001
  • Computed Output (AVX2/FMA SIMD): -0.5
  • Numerical Precision Delta: $1.37 \times 10^{-6}$
  • Test Run Duration: 0.004 seconds for the entire suite.

🛠️ Try it on your own hardware

The code compiles out-of-the-box on standard Windows (MinGW/GCC) and Linux/WSL environments with zero external compile-time library dependencies.

GitHub Link: https://github.com/Nexa1nc/NexaQuant

Developed by Nexa1nc with the philosophy of extreme, hardware-level optimization.

reddit.com
u/WeAreNex4_ — 3 days ago
▲ 2 r/LocalAIServers+1 crossposts

Help building a homelab

Hi, I want to build a server to do multiple things, mainly running AI models (maybe training) and pipelines.
I have the following build, but it's my first time building one and I'd be happy to receive any kind of feedback, wether they are mistakes I made or possible improvements.

CPU: Ryzen 5 7600X
Cooler: Thermalright Peerless Assassin 120 SE
Motherboard: MSI PRO X870E-P WIFI
RAM: 32GB DDR5-6400
GPU: RTX 3060 12GB
Storage: WD SN850X 1TB NVMe

The GPU is old, but wanted to get something with enough vRAM to run medium-sized models.

Full list on https://pcpartpicker.com/list/QCzyK7

reddit.com
u/Plenty-Construction9 — 2 days ago
▲ 2 r/LocalAIServers+1 crossposts

Is the RX6800 worth it for Local inference over my 3070 + 3060 build ?

Hello, I would like to get some insight if anybody has used an RX6800 local ai build, currently I have an rtx 3070 and RTX 3060 12gb, I was thinking about the following possibilites after the only 3090 deal in my local market is gone:

1- adding another 3060 12gb and ditch the 3070 ?
2- selling both cards and getting 2xRX6800 for the 32GB VRAM total
3- getting something else like a 3080 ti or an Nvidia 12gb card

The idea for my build that I am now too limited in terms of context size for Qwen3.6 27b as I can't exceed ~32k tokens context size. I am also thinking about image and video gen possibilities in the future, but that is not my main goal for the local build, it is mainly text gen.

With my 3070+3060 build, I am getting 16-20 tps and a 300-500tps during pp, it could be enough for chat and some agentic work, but slow for code gen.

there is actually no RTX 3090 in my local market, and the budget is limited, that is why I thought about the RX6800 since it has 512.0 GB/s bandwidth and 16GB VRAM.

I would appreciate your advice.

Thank you

reddit.com
u/allqbanks — 3 days ago

Would you rather have a 2025 Mercedes-Benz, cash downpayment on a $500,000 home, or this? All going in a Corsair 9000D Airflow next week.

u/DeedleDumbDee — 6 days ago
▲ 50 r/LocalAIServers+5 crossposts

New Asus Flow Z13 KJP Edition Laptop Purchased - Guidance Needed for Dev Env Setup

Good People.

I purchased a new Asus Flow Z13 KJP Edition Strix Halo laptop (cum tablet). It’s got 128 GM unified memory RAM machine, and I specifically bought it for running local LLMs. I have been into Apple ecosystem and have been a Mac OS user since 18 years. However, as the Mac Studio prices are beyond my affordability, have gone ahead with this. I have been into software web development for quite sometime and I am comfortable with trying out new things.

I would like to get some guidance around the setting up this machine for development, specifically Web, Mobile Apps and running Local LLM. I have bought a Windows 11 Pro license and upgraded to it, however I would like to also dual boot it to Ubuntu or a specific flavour of Linux that is optimised for performance and general broad support for drivers and development packages easy availability. The idea is not to spend a lot of time tinkering or learning the Linux side of things but to optimise the machine for local LLM running.

Coming from Mac OS to Windows and Linux, I also want to get your inputs if there are any particular things that would help me along this journey that I am starting.

Please do share your inputs and thoughts on the comments, not just specific to what I am looking for but anything and everything that will be of help to me, I would really appreciate it.

Thank you.

u/bmanojk — 6 days ago
▲ 126 r/LocalAIServers+17 crossposts

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out. ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

  1. Director Agent - Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
  2. Character masters - FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step - reference editing pins identity across shots by construction
  3. Per-shot keyframes - FLUX.2 again with reference image. Sub-second per keyframe after warmup
  4. Animation - Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
  5. Vision critic - same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
  6. Music - ACE-Step v1 generates a 30s instrumental from Director's brief
  7. Narration - Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
  8. Mix - ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about):

  • 1280×720, not 640×640 default. Costs more but matches what producers want
  • 121 frames at 24 fps was my first attempt - gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up
  • flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults)
  • Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker
  • Camera language: ONE camera verb per shot, sentence-case, placed first ("Tracking shot following from behind"). Multiple verbs in one prompt cancel each other out
  • Avoid the word "cinematic" - triggers Wan's stylization branch, gives the AI look. Use lens/film tags instead ("Arri Alexa, anamorphic, 35mm film grain")

Performance work:

  • ParaAttention FBCache (lossless 2× on Wan2.2)
  • torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) - another 1.2×
  • AITER MoE acceleration on Qwen director (vLLM)
  • End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores - should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic's failure taxonomy in comments.

u/Inevitable-Log5414 — 8 days ago

Need advice for a $10,000 AI workstation build (video, image, voice, LLMs, training, everything)

Need advice for a $10,000 AI workstation build (video, image, voice, LLMs, training, everything)

I’m planning to go very deep into the AI space and I want to build a serious workstation with around a $10,000 budget.

Main use cases:

- Local LLMs
- AI image generation
- AI video generation
- Voice cloning / speech models
- Fine-tuning and training
- Running multiple AI tools simultaneously
- Heavy VRAM workloads
- Stable Diffusion / Flux / ComfyUI
- Open-source models
- Maybe some game dev / rendering too

I want something that will still be powerful and relevant for the next few years instead of becoming obsolete immediately.

What hardware configuration would you recommend today for this budget?

Questions I’m specifically confused about:

  1. CPU:
    Should I go Intel or AMD for AI workloads?
    Is Intel actually better for compatibility/stability or is AMD better now?

  2. GPU:
    I know NVIDIA is basically mandatory for CUDA, but which setup makes the most sense?

- Single RTX 5090?
- Dual 4090s?
- Multiple GPUs?
- Used enterprise GPUs?
- Wait for newer cards?

  1. Motherboard:
    Does Intel CPU + NVIDIA GPU + Intel motherboard work “best together” in terms of compatibility/stability?

Or does motherboard brand/platform not really matter much as long as PCIe lanes, RAM support, and power delivery are good?

  1. RAM:
    How much RAM is realistically needed now?
    128GB?
    256GB?

  2. Storage:
    What’s the smartest storage setup for AI workloads?
    Separate NVMe drives for models/cache/projects?

  3. Cooling + PSU:
    How crazy do cooling and PSU requirements get once you start doing heavy AI workloads 24/7?

  4. Linux vs Windows:
    Do most serious AI people just use Linux at this point?
    Is Windows still okay for heavy AI work?

I’d really appreciate recommendations from people actually doing AI locally instead of generic gaming-PC advice.

If you were building the best possible AI workstation around $10k today, what exact parts would you choose and why?

reddit.com
u/Mission_Objective163 — 8 days ago

If I'd ever win lottery, no one would know. But there will be signs!!

Who else is thirsty for beefy server GPU to test AI models locally?

u/Medical_Ask_6169 — 9 days ago
▲ 2 r/LocalAIServers+1 crossposts

Would indie devs be interested in affordable GPU compute? (Validating demand before I build anything)

Hey folks — I’m exploring an idea and wanted to validate demand before I spend any money.

I’m considering setting up a small, privacy‑friendly GPU node for indie devs, tinkerers, and people running local LLMs. Before I invest in hardware, I want to see if this is something the community would actually use.

Hardware I’m looking at:

- 8× Tesla P100 (16GB SXM2)

- Great for fine‑tuning, inference, agent hosting, and experimentation

- Enterprise chassis with proper airflow and cooling

Network:

- 1 Gbps FTTH (symmetrical)

- Low latency, stable

- Can upgrade to a dedicated line if demand grows

This is NOT a sales pitch.

I’m not selling anything right now. I’m just trying to understand whether indie devs would find this useful before I commit to the build.

If this existed, would you be interested in renting access?

If so, I’d love to hear:

- What workloads you’d run

- How often you’d use it

- What pricing feels fair

- Whether you prefer hourly or monthly

- Any deal‑breakers or must‑haves

I’m aiming for something affordable, predictable, and privacy‑first — something between “local GPU” and “CoreWeave pricing.”

Again, not launching anything yet. Just validating demand before I build it.

Appreciate any feedback.

reddit.com
u/TymasX — 6 days ago

Seeking Recommendations: $1400 AI Research Workstation (Training from Scratch, NLP/CV)

Hi Everyone,

I'm working with a tight budget of $1300–1400 to put together a dedicated workstation for training AI models from scratch, focused on research tasks in NLP and Computer Vision. My current plan is to start with a used Tesla V100 32GB, but I'm open to suggestions if there's a better value option for experimental/research workloads within this price range.

Primary use case:

- Training small-to-mid-sized models from scratch (not just fine-tuning)

- Research-focused experiments in NLP and CV

- Occasional inference, but training throughput and VRAM capacity are the priority

- Budget-conscious setup (academic/research context, not enterprise)

Current thinking:

- GPU: Tesla V100 32GB (leaning towards used/refurbished)

- CPU: Undecided — need something that won't bottleneck PCIe throughput or data preprocessing

- Motherboard/RAM: Open to recommendations; planning 64–128GB RAM to handle large datasets

- Storage: NVMe for datasets/checkpoints (already covered)

Is the V100 32GB still a sensible starting point for research training in 2026, or would you recommend saving for a used RTX 3090/4090 or professional card like A100/A40?

What CPU/platform would pair well without over-investing? (e.g., Ryzen 9 7950X vs. Threadripper vs. used Xeon)

Any motherboard/chassis considerations for GPU cooling and PCIe lane allocation when running a single high-end accelerator?

For research workflows: is 32GB VRAM enough to experiment meaningfully with transformer-based NLP or vision models from scratch, or should I prioritize VRAM over raw compute?

I'm not chasing SOTA training speeds. Stability, reproducibility, and the ability to iterate on architecture experiments matter more. Also happy to consider dual-GPU setups down the line if the platform supports it.

Thanks in advance for any insights!

reddit.com
u/vonexel — 8 days ago

I built a zero-VRAM speculative decoding engine that runs 1.2x faster on consumer GPUs — no second model needed

Hey everyone,

I've been working on a speculative decoding engine called Structspec that makes local LLMs generate code faster without needing a second model in VRAM.

The idea is simple: instead of loading a draft model, it mines token patterns from a code corpus and combines them with syntax-aware rules (indentation,

brackets, keyword transitions). These propose draft tokens that get verified in a single pass against the real model.

Tested on Qwen2.5-Coder-7B with an RTX 4050:

- ~1.2x wall-clock speedup

- 100% draft acceptance on some prompts

- Zero extra VRAM used

The part I'm most excited about is something I called SymbolicMotifCache — it abstracts code patterns across variable names. So `current = current.next`

and `node = node.left` get recognized as the same underlying pattern. I think this could be useful beyond just code generation but I'm still figuring out

the limits.

I have a few ideas to push this further — better pattern generalization, support for more languages, and combining this with quantization-aware

techniques. Still learning a lot about the inference optimization space.

If this sounds interesting, a star on the repo would mean a lot — I'm a student trying to build up my portfolio and every bit of visibility helps.

Repo: https://github.com/neerajdad123-byte/zero-vram-spec

Would love to hear feedback or suggestions. Happy to answer any questions about how it works.

https://reddit.com/link/1tdspq2/video/tgyh0i8h7a1h1/player

reddit.com
u/PangolinLegitimate39 — 7 days ago
▲ 3 r/LocalAIServers+1 crossposts

How much storage do you need to hoard models locally?

Hi All, I'm wondering how many TBs of storage you would fill with locally saved LLMs if you thought they would become unavailable online for download. I'm thinking about both large and small models, like a snapshot of the best of everything there is available online right now. Could be for coding, for writing, or for automation/robotics. Assuming that you also have the hardware to run models of any size, what's in your bugout load out if the grid goes down?

reddit.com
u/somebodys-something — 10 days ago
▲ 6 r/LocalAIServers+2 crossposts

I’m building Kimari Local AI: an open-source toolkit for running LLMs locally on older NVIDIA GPUs

I’m building Kimari Local AI, an open-source toolkit focused on running local LLMs on older consumer NVIDIA GPUs like the GTX 1060 6GB and GTX 1080 8GB.

The goal is not to claim magic performance or pretend an old GPU can compete with modern hardware.

The goal is more practical:

  • make local AI easier to run on hardware people already own
  • provide sane GPU profiles for limited VRAM
  • support GGUF models through llama.cpp + CUDA
  • expose a local OpenAI-compatible API
  • add CLI tools for setup, diagnostics, benchmarking and model compatibility
  • keep everything local: no cloud, no subscriptions, no telemetry

Current status: v0.1.57-alpha.

What works today:

  • CLI commands like doctor, start, status, bench, fit, optimize and pull
  • llama.cpp runtime support with CUDA acceleration
  • local OpenAI-compatible endpoint
  • KimariFit scoring concept: useful intelligence per GiB of VRAM
  • GPU profiles for old cards
  • Open WebUI / Continue / local agent integration plans
  • Hugging Face presence with a demo/checker Space and compatible GGUF model collection

Important clarification:

Kimari is currently the framework, not the final model.

Kimari-4B is planned and under development, but no public weights, adapters or official GGUF files are released yet. For now, Kimari is designed to run compatible existing GGUF models locally.

I’d appreciate technical feedback, especially from people running local models on older hardware.

u/SnooMarzipans9093 — 9 days ago
▲ 11 r/LocalAIServers+4 crossposts

ZERO-VRAM-SPEC Which speeds up 1.3X in code genarationg without taking any extra vram

https://github.com/neerajdad123-byte/zero-vram-spec
I replaced draft model entirely with a python rule based AST predictor which seems working well in predicting grammer forced tokens and also indentations

While doing this project i learnt many things about implementation of all types of spec decoding and also
how tokens work and everything about MTP(multi token prediction) and many things

Looking up for an intenship
passion is to build things
Leave a star for me it would be very much helpful to me

u/PangolinLegitimate39 — 8 days ago

Mi50 16GB or V100 16GB?

Hey everyone! I'm checking out GPU market for a local LLM. I'm interested in the mi50 16GB and the v100 16GB (the 32GB versions of both GPUs are unjustifiably expensive).

Here’s what I’ve noticed while researching the topic:

V100 - the "safe" option that just works. But there's a catch: it's SXM2, so you need to buy a PCIe adapter + cooling. Ideally, you could mount cooling from a 5090-4090 (or something simpler), and then you can probably forget about overheating.

The only downside is that everything will cost more, but it'll work fine if you set it up right.

mi50 - in terms of specs, it's better than the v100, but I see some serious (in my view) problems:

- Different BIOS versions that need to be installed depending on task. Like using the Radeon VII BIOS to make it work in consumer motherboards, but sellers usually sell them already flashed, so that shouldn't be an issue.

- "Insufficient multithreading" - https://www.reddit.com/r/LocalAIServers/comments/1koltfb/comment/mt1ihpe/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button - the commenter is likely talking about vLLM.

- Old ROCm - requires some tricks with .env (which isn't a problem), but if you need anything beyond LLM inference (for example, if you want to fine-tune a model), then big problems start to arise. With the v100, these issues are much less frequent (CUDA, after all).

On the plus side, the mi50 is cheaper than a bare v100 SXM2 (and the mi50 comes with a heatsink and PCIe by default).

Also, a downside for both is the lack of flash-attention-2 support, which means newer models might just not work (though it's unclear if they won't work in vLLM or llama.cpp).

So the question remains: knowing these nuances, which is the better choice? Keeping in mind that I'll likely buy several GPUs.

reddit.com
u/CommonResearch3314 — 12 days ago

I switched fully to local AI for a week — something changed

I stopped using cloud AI tools entirely for the past week.

Everything now runs locally.

What surprised me wasn’t performance — it was how my workflow started changing in unexpected ways.

Feels like we’re closer to personal AI stacks becoming normal than I thought.

Has anyone else fully committed to local setups

reddit.com
u/Classic-Space-5705 — 11 days ago

On-premises enterprise AI coding deployment is harder than vendors say and easier than IT teams fear

Done on-premises enterprise AI coding deployments at three different organizations. The gap between vendor documentation and operational reality is consistent enough to write up.

What vendors undersell is that the initial model selection and sizing is more consequential than they imply. The model that produces acceptable inference latency for 50 developers on your hardware may produce unacceptable latency for 200. Getting sizing right before committing to hardware is genuinely difficult and vendor estimates are optimistic. Context engine configuration is also more work than "connect it to your repos" on complex enterprise codebases.

What IT teams overestimate is the ongoing operational overhead. Once the deployment is stable it's much lower than most internal teams expect. It's infrastructure maintenance. The tools designed for enterprise AI coding deployments have admin interfaces that don't require deep AI expertise to operate. The things that go wrong are things IT teams already know how to handle.

The organizations that struggle with on-premises AI coding are the ones that either chose hardware before understanding real sizing requirements or tried to do it without someone who's done a deployment before owning the initial configuration.

reddit.com
u/Major-Language8609 — 10 days ago