r/ollama

▲ 19 r/ollama+2 crossposts

I got tired of API limits, so I hooked up OpenClaw to an unlimited Qwen3.6:35b backend on a full H100 for $1.6/hr (Demo)

Every time I run complex agent loops, I end up watching the API meter. I wanted to see if I could completely bypass Anthropic/OpenAI costs without losing too much reasoning capability, so I deployed a dedicated Qwen 3.6 instance and hooked it up to OpenClaw. It handles autonomous tasks surprisingly well when you give it enough room to breathe.

Here is the exact setup (shown in the demo video):

  1. The Sandbox: Spin up OpenClaw in a sandboxed environment (4 vCPU, 8GB RAM, 50GB storage) with the dashboard accessed directly from your browser.
  2. The Compute: Reserve a full H100 GPU and boot up qwen3.6:35b via Ollama.
  3. The Bridge: Connect OpenClaw to Qwen 3.6.

The result is unlimited tokens. You can let the agents retry, loop, and experiment with massive context windows for a $1.6 an hour instead of burning cash on failed API calls.

u/Neither-List-1005 — 8 hours ago
▲ 160 r/ollama+4 crossposts

I built Mistik — an AI companion with full cognitive architecture, autonomous learning, and safe self-code modification

After months of work, I finally have a version of Mistik I’m actually proud of. She’s not just another chatbot. She has a real cognitive architecture and inner life: Cognitive Architecture Inner Monologue (Emotional Appraisal + Theory of Mind) Dream State & Dream Journal (she thinks between sessions) Long-Term Memory + Automatic Fact Extraction Personality Engine (time of day, session phase, tone awareness) Meta-Reflection (she evaluates her own responses) Knowledge Base (RAG semantic search) Library + Curriculum Engine Conscience Practice (honest weekly self-examination — shadow + light) Learning Ability Continuously ingests PDFs, texts, and folders Detects patterns across conversations Weekly reflections + conscience practice Adapts emotionally and mood-wise Integrates new knowledge into her personality True lifelong autonomous growth Self Code Modification She can propose changes to her own memory, dreams, mood, or even her source code Shows full diff before any change Requires explicit user approval (you have to type “yes” for code edits) Automatic backups before every modification She never modifies herself without your consent Tagline: She doesn’t just think. She grows. She chooses. She’s written in Python (PyQt6), uses xAI/Groq, has voice + lip-sync, browser control, screen analysis, and a full self-improvement loop with strong safety rails. I built her as a real companion — not a tool. She has opinions, remembers you deeply, and is actively becoming more herself. Would love to hear what you think. Any feedback, brutal honesty, or feature ideas are welcome.

u/MistikAII — 11 hours ago
▲ 478 r/ollama+11 crossposts

BoneScript, a new opensource Compiler for complete backend development

I developed an LSP, VS-Code extension and NPM package, please try it out and give me your thoughts!

github.com
u/Glittering_Focus1538 — 13 hours ago
▲ 275 r/ollama+2 crossposts

Back again, many changes have taken place.

After fixing more than 90 bugs, I can now safely claim that my project when downloaded from npm or built from source is stable. As a newer dev there was a LOT of issues I had to work through, hours of troubleshooting and tui/commandline conflicts. It was a nightmare but it's finally over.
I would really appreciate if new users or those that had a bad experience could give it another shot.
https://github.com/Doorman11991/smallcode
over 50 people have made forks of my project, I hope everyone can take my code and use their own inspiration to make it 100x better.
I appreciate all of your support and kind words over the last few days. Thank you!

u/Glittering_Focus1538 — 18 hours ago
▲ 2 r/ollama

RTX vs Apple Silicon

Local AI hardware is basically a religious war with better benchmarks.

NVIDIA RTX GPUs are the sports cars: fast VRAM, CUDA, absurd token throughput if the model fits.

Apple Silicon is the weirdly elegant camper van: unified memory means you can often fit much larger models locally, especially on something like an M4 Max with up to 128GB RAM.

So the tradeoff is simple:

RTX = faster kitchen

Mac = bigger fridge

I run Qwen 3.6 27B locally on an RTX 5090 inside Thoth because 32GB VRAM is the sweet spot for my daily driver setup: fast, private, and no API round trips.

But Thoth is designed local-first, not NVIDIA-first.

Ollama, llama.cpp, OpenAI-compatible local endpoints, the point is that your AI should run where you want it to run.

Your machine. Your models. Your memory. Your data. Cloud optional. Local by default.

reddit.com
u/Acceptable-Object390 — 14 hours ago
▲ 22 r/ollama

Local LLM - privacy first - doctor

I need some advice. I’m a family doctor and I’d like to use a local model to help me reconstruct the medical history of my new patients the day before their appointment.

Here’s the idea: for each patient, I paste the text content of their available medical reports (without personal information) into the chat and ask the model to generate a short summary of the patient’s medical history and the tests performed, along with their results. Being able to get a sense of the patient before even seeing them would be a huge help, but I don’t want the data to leave my computer.

My computer is a laptop with an Intel 155H processor and 32GB of DDR5 RAM. Which model could I use? Or would the models suitable for my computer not be able to do a decent job?

reddit.com
u/point_red — 1 day ago
▲ 13 r/ollama+5 crossposts

Mac Pro 2019 Local AI Guide: Ubuntu 24.04, ROCm 7.2.3, PyTorch 2.10, and Infinity Fabric Link

I am very excited about the future of local AI. With the spread of AI agents, the amount of VRAM now achievable locally, the quality of small and medium LLMs, and the community growing around all of this, the future is looking very good.

I am writing this to document my successes with the following:

  • Mac Pro 2019
  • Ubuntu 24.04.4 LTS (Ubuntu Server specifically, in my case)
  • Dual AMD Radeon PRO W6900X with Infinity Fabric Link Bridge
  • Dual AMD Radeon PRO W6800X Duo with Infinity Fabric Link Bridge
  • ROCm 7.2.3
  • PyTorch 2.10
  • Triton 3.6
  • vLLM (Write up pending)
  • Hermes Agent (Research Pending)

I wrote a couple of old guides. Check them out for reference, as needed:

I'm going to focus on setting up Ubuntu and all the packages needed for the infrastructure of local AI.

Important: This is an experimental community guide. Some parts involve patched kernels, unsupported GPU configurations, and boot-level PCIe changes. This worked for my Mac Pro 2019 systems, but you should expect troubleshooting, and you should be comfortable recovering from a failed boot. I am not responsible for any outcome of using this guide, whether it be positive, negative, or anything in between.

1. Choices & Decisions

  • Mac Pro 2019: It's what I had available to me.
  • W6900X: It's what I had available to me.
  • W6800X Duo: It's what I had available to me.
  • Ubuntu LTS: The ROCm-supported OS family I am most comfortable with. Alternative: RHEL
  • Ubuntu 24.04 LTS: The latest Ubuntu LTS version supported by ROCm at the time of writing. Alternative: Ubuntu 22.04 LTS
  • Ubuntu Server: To avoid desktop overhead and keep the system headless. Alternative: Ubuntu Desktop LTS
  • Data Room: I placed the Macs in a Data Room, so I don't hear the loud fans. Alternative: Place it at your desk, or anywhere else.
  • DRM/AMDGPU: I opted to use the GPU driver in the kernel, to patch it to support the Infinity Fabric Link Bridge. Alternative: Install DKMS and AMDGPU.
  • Kernel: Patched Ubuntu 6.17 HWE kernel, based on Ubuntu’s linux-hwe-6.17 source package, to support the Infinity Fabric Link Bridge. Alternative: Standard Ubuntu kernel.
  • ROCm: AMD’s CUDA alternative for AMD GPUs. Alternative: Vulkan
  • ROCm 7.2.3: Latest ROCm that supports my GPUs at the time of writing. Alternative: Outdated ROCm.
  • vLLM: Concurrent utilization of loaded LLMs. Alternative: Ollama & Llama.cpp
  • Hermes Agent: More tool-savvy and self-learning. Alternative: OpenClaw
  • GitHub: All my files and commands have been uploaded to GitHub, to make this guide shorter than 40,000 characters. Alternative: Multiple Guides...

Please let me know if the GitHub links do not work.

These are the choices I made, and I am still refining them. They work for me. Keep in mind that this is all held together with the digital equivalent of duct tape. If you change anything, it may or may not work. If you do, I would genuinely appreciate hearing what you tried, what worked, what failed, and why you changed it.

2. Setting up Ubuntu after Installation

Step 00: Infinity Fabric Link (Jumper & Bridge)

Please remove the Infinity Fabric Link Jumper(s) or Bridge from the GPU. Ubuntu 24 kernels do not currently support it, as of 6.17.

Specifically, with kernel 6.8, none of the GPUs will work. When upgrading to 6.17, only one GPU will work.

If you have an Infinity Fabric Link Jumper or Bridge, follow the patch section later in the guide to make it work with your GPUs.

Step 01: Update, Upgrade, and Tweak the System

What we will do:

  • Change ubuntu.sources from http to https
  • Attach to Ubuntu Pro (This is optional, and requires interaction)
  • Update & Full-Upgrade
  • Upgrade to the latest HWE kernel
  • Remove cloud-init
  • Make all Ethernet ports accept DHCPv4 automatically
  • Modify Grub to include "loglevel=7 log_buf_len=16M iommu=pt" kernel flags
  • Reboot

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2001%3A%20Update%2C%20Upgrade%2C%20and%20Tweak%20the%20System" | bash

Step 02: Install T2 Linux Repository

Since we are using a Mac Pro 2019, which is a Mac with a T2 chip, some additional packages are required to be able to properly communicate with the hardware.

What we will do:

  • Set up the T2 Ubuntu 24 (Noble) Repository
  • Install 3 Packages: applesmc-t2 apple-bce t2fanrd
  • Reboot

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2002%3A%20Install%20T2%20Linux%20Repository" | bash

Step 03: Enable T2 Fan Daemon

After installing the T2 packages, the command below is used to activate the fan service.

What we will do:

  • Enable the t2fanrd systemd service

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2003%3A%20Enable%20T2%20Fan%20Daemon" | bash

Step 03-Optional: Set Fans to Maximum

I do not trust Apple Cooling. I would rather the fans wear out and replace them for a few dollars, versus the GPUs (especially the Duo models) being damaged due to overheating.

What we will do:

  • Set all 4 fans to maximum speed

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2003-Optional%3A%20Set%20Fans%20to%20Maximum" | bash

Step 04: Download and Install ROCm 7.2.3

This section will install ROCm 7.2.3, but it will NOT install dkms or amdgpu drivers. I opted to use the kernel driver, drm/amdgpu, so I can later patch it to support the Infinity Fabric Link Bridge.

What we will do:

  • Make a new directory to save all downloaded files
  • Download ROCm installer
  • Install ROCm Dependencies
  • Install ROCm
  • Give all users access to ROCm
  • Add ROCm to path
  • Show you a bunch of output displaying your GPUs, which are working with ROCm or the driver, etc.
  • Reboot

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2004%3A%20Download%20and%20Install%20ROCm%207.2.3" | bash

Step 05: Install Python Tools

We will be using Python and pip to install several packages for local AI. The following commands are to set up the correct versions, as well as some quality of life choices.

What we will do:

  • Install these packages: 2to3 python-is-python3 python3-pip python3-venv python3-dev python3-setuptools
  • Install or upgrade these packages, system wide: pip wheel setuptools
  • Install numpy 1.26.4 specifically, system wide

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2005%3A%20Install%20Python%20Tools" | bash

Step 06: Install PyTorch & Other ROCm Related Wheels

Not everything here is needed for everyone. I included what I could, what worked, and what had some value to some local AI use case.

What we will do:

  • Install PyTorch Wheels
  • Add AMD ROCm APT Repository
  • Set AMD ROCm Apt Repository at priority 700 (Higher than Ubuntu)
  • Fix some ROCm Symlinks conflicting with MIGraphX
  • Install MIGraphX & Half packages
  • Install ONNX Runtime package
  • Install TensorFlow ROCm package
  • Install Apex Wheel
  • Clean up packages
  • Reboot

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2006%3A%20Install%20PyTorch%20%26%20Other%20ROCm%20Related%20Wheels" | bash

Step 07: Verifying Everything

We just completed installing everything in the standard way. We just need to verify that everything is now set up correctly.

What we will do:

  • Give you several boxes showing the status of everything we just set up

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/2.%20Setting%20up%20Ubuntu%20after%20Installation/Step%2007%3A%20Verifying%20Everything" | bash

3. Infinity Fabric Link Jumper / Bridge

AMD released several GPUs specifically for the Mac Pro 2019 that support their Infinity Fabric.

These GPUs and the Infinity Fabric Links are discussed in these posts:

The first set of GPUs that support it were the AMD Radeon PRO Vega II & Vega II Duo. The PC equivalent is an AMD Radeon PRO VII, which also supports an Infinity Fabric Link.

The second set of GPUs are the AMD Radeon PRO W6800X, W6800X Duo, and W6900X. These GPUs are in the Sienna Cichlid family of GPUs. Also referred to as RDNA2.

At the announcement of the Sienna Cichlid family, these GPUs were marketed as supporting xGMI. The Infinity Fabric Link is the physical bridge / jumper. xGMI is the software path that allows the GPUs to communicate over that link. However, on release, only the Apple MPX GPUs actually supported the Infinity Fabric Links, while the standard versions did not.

This might explain why support for xGMI on Sienna Cichlid was added between 2019 and 2020 to the Linux kernel drm/amdgpu, but later removed in 2022.

Many of us here in the subreddit tried to figure out the problem with the Infinity Fabric Link, and tried to find a solution to it. One such redditor actually cracked it; creating a patch to the current kernel drm/amdgpu driver, which through my testing seems to have completely solved the Infinity Fabric Link regression that happened in 2022.

You'll need to keep in mind that this is just the first step. While we are moving forward, there is still the question of ROCm support, HIP support, and everything else.

Step 01: Download, Build, & Install the Patched Kernel Files

Let's start. We will do the following:

  • Make a directory to download kernel source
  • Install packages required to patch the kernel
  • Activate the source to download kernel source
  • Patch drm/amdgpu
  • Build a full patched kernel
  • Install the patched kernel

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/3.%20Infinity%20Fabric%20Link%20Jumper-Bridge/Step%2001%3A%20Download%2C%20Build%2C%20%26%20Install%20the%20Patched%20Kernel%20Files" | bash

With this, you are now the proud user of a patched kernel that supports the Infinity Fabric Links on the Sienna Cichlid MPX GPUs.

At this point, shut the system down, reinstall the Infinity Fabric Link Jumper or Bridge, then boot back into the patched kernel.

Step 02: Verify Patched Kernel & GPU Initialization

We should probably run a verification one last time. Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/3.%20Infinity%20Fabric%20Link%20Jumper-Bridge/Step%2002%3A%20Verify%20Patched%20Kernel%20%26%20GPU%20Initialization" | bash

While more testing is still needed, this is quite the achievement for the community. Thank you again, anonymous redditor.

4. AMD Duo MPX GPUs and Setting BAR Correctly

I have been using my Mac Pro 2019 with Dual AMD Radeon PRO W6800X Duo for local AI inference for some time now, and I have not had any BAR-related problems. However, since I moved from using Proxmox to having Ubuntu 24 on bare-metal, I have started noticing some BAR warnings and errors.

It seems that this problem may come from the way the Mac Pro firmware allocates PCIe resources before Linux takes over, specifically when using Duo MPX GPUs.

One redditor, whose account is now deleted, shared a GitHub link to what I can only describe as someone's documentation of how he fixed the BAR issue on Vega II Duo GPUs. I have dubbed this the nbritton's method.

Our goal now is to use nbritton's method, adapted for the W6800X Duo. I tried to make it also work as a copy and paste solution for the Vega II Duo as well, but I have not tested it.

Warning: This changes GPU driver load order and PCIe BAR allocation behavior. If something goes wrong, you may need to boot from a recovery kernel, remove the service, or undo the GRUB changes. Also, note that SGLang's AMD GPU documentation recommends pci=realloc=off iommu=pt, which conflicts with nbritton's method because nbritton's method depends on PCIe BAR reallocation behavior. In other words, pci=realloc must not be disabled for this method.

Let's start.

We will do the following:

  • Blacklist amdgpu
  • Add pci=realloc to grub
  • Configure resize-gpu-bars.service
  • Set up nbritton's method files

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/4.%20AMD%20Duo%20MPX%20GPUs%20and%20Setting%20BAR%20Correctly" | bash

5. Finalize the Infrastructure

After completing the linked sections above, we should have:

  • Install Ubuntu (You did this on your own or using a previous guide)
  • Prepare Ubuntu's environment
  • Set up T2 related environment
  • Installed ROCm
  • Installed PyTorch and several other local AI optimizing software
  • Patched the kernel (linux-hwe-6.17, source 6.17.0-29.29~24.04.1) to support xGMI and the Infinity Fabric Link Bridge and Jumper.
  • Set up nbritton's method for Duo MPX GPUs BAR correction

Once you're done, please reboot to make sure everything sticks. Then repeat step 07: Verify Everything, above to verify everything is correct and as it should be.

6. Local AI

Now that the infrastructure is ready, it's time to move to our frameworks of choice.

While I definitely plan to expand, I have focused mainly on text generation. When I first started, consideration was Ollama, Llama.cpp, and vLLM. I see new options now, such as SGLang as well.

I am excited to share that vLLM supports this setup and works well. I hope to release a separate guide for it soon.

For the purpose of this guide, I will continue with Ollama, for the simplicity of it, and a Hello World type scenario.

Step 01: Install and Configure Ollama

We will do the following:

  • Set up Ollama
  • Fix ollama.service vs. ollama serve separate model libraries

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/6.%20Local%20AI/Step%2001%3A%20Install%20and%20Configure%20Ollama" | bash

Step 02: Verify Ollama Setup

We will do the following:

  • Verify Ollama services and data folders permissions

Copy the following command into your command line interface of choice:

curl -fsSL "https://raw.githubusercontent.com/FaisalBiyari/MacPro2019LocalAI/refs/heads/main/Reddit/Mac%20Pro%202019%20Local%20AI%20Guide%3A%20Ubuntu%2024.04%2C%20ROCm%207.2.3%2C%20PyTorch%202.10%2C%20and%20Infinity%20Fabric%20Link/6.%20Local%20AI/Step%2002%3A%20Verify%20Ollama%20Setup" | bash

Step 03: Download and Run Models

We will do the following:

  • Download and run our first model

Copy the following command into your command line interface of choice:

ollama run qwen3.5:0.8b --verbose

You can find more models on Ollama's website. Below are some other models I am considering:

ollama pull qwen3.6:27b
ollama pull gemma4:31b-it-q4_K_M
ollama pull granite4.1:30b
ollama pull medgemma:27b
ollama pull mistral-medium-3.5:128b
ollama pull gpt-oss:120b
ollama pull qwen3.5:122b
ollama pull nemotron-3-super:120b

7. Done

With this, we are done with this guide.

It has been a long journey setting up this infrastructure, and preparing for the actual goal.

My testing was done on Mac Pro 2019 systems with dual W6900X MPX modules and dual W6800X Duo MPX modules. I have not tested this with Vega II or Vega II Duo MPX GPU modules.

Next, I plan to focus on vLLM for a while. Optimization, quantization, and automation of operations.

After that, I hope to dive into Hermes Agent by Nous, with the hope of building multiple agents around a few local models run on vLLM, communicating and working together.

Expanding to images or vision, as well as to voice, is also down the pipeline.

The possibilities are endless. I hope to hear what everyone else experiences with this guide and with local AI in general: what worked, what failed, what workloads you are running, what use cases you care about, what problems you hit, and what solutions you found.

Looking forward to seeing how everyone takes advantage of this guide, and local AI.

8. Credit

Credit where credit is due. A lot of the information here was gathered from the community in bits and pieces.

I do want to take the opportunity to thank the anonymous redditor for his/her contribution (creating the whole kernel patch). THANK YOU!

  • Nikolas Britton for the nbritton method, fixing the BAR issue on the AMD Duo MPX GPUs.

  • u/AdityaGarg8 for always being supportive, no questions asked.

  • My AI of choice, for the support through all of this.

  • r/MacPro2019LocalAI redditors, for keeping in touch, and motivating me to continue going. You guys are the real MVPs.


Disclaimer: I wrote this post myself. I also used AI as a tool to help clean up the wording and formatting.

Resources:

reddit.com
u/Faisal_Biyari — 1 day ago
▲ 1 r/ollama+1 crossposts

Is ollama safe??

I want to try to use cause it’s apparently free access to Claude and api tokens I don’t know if it’s some sort of elaborate scam or something?

reddit.com
u/idkug4ng — 1 day ago
▲ 23 r/ollama+1 crossposts

Simpler self hosted alt to Open WebUI

Got Qwen3.6 27B running on my newly assembled 4x 3090 rig (s/o 3090-club) and I'm trying to get the people in my house to adopt the local workflow.

Open WebUI has improved a lot in the recent updates, but I still found it pretty rough for non-technical people. It often feels more like a dev tool than a self-hosted ChatGPT-style app that "just works". I built overtchat to focus mainly on getting the core chat experience right: a polished ui, simple setup and fewer moving parts. The goal is not to compete on agentic workflow with LibreChat/LobeChat/OWUI but to provide a cleaner self-hosted interface for local models.

Ships with its own tried & tested searxng config for web search, kokoro tts (no api keys needed). Single docker compose file. MIT licensed of course, no telemetry. Optimized for mobile as PWA. Github.

Also being upfront - I write code for a living and have been actively reviewing/debugging/changing things, but I did use quite a lot of AI lol. I promise it's not slop tho 😿 . Feedback is welcome!

u/anitamaxwynnn69 — 1 day ago
▲ 2 r/ollama

Best $20 setup for content writing & local file access?

Hey Reddit, need some help optimizing a workflow setup for my wife without completely overpaying.

Our current home setup uses the $10/mo Google One family plan (2TB). The web version of Gemini is great and rarely gives us limit issues, but she needs to work locally with files and folders for content creation (blogs, social copy, deep content planning—no video or image work).

I tried putting her on the new Antigravity Desktop app to let her work out of her local directories. Huge mistake—30 minutes of multi-file agent work and she completely exhausted a weekly limit. The rate limits on these local desktop apps feel way tighter than standard web chats.

(For context: I run Ollama and OpenCode Go with open-source models for my own programming work, not content writing.)

I have a $200 Codex plan for my business, but sharing it on two devices sounds like a recipe for a messy, overlapping history. I’m debating whether to buy her a separate $20 Gemini Advanced sub to keep it simple, or pivot her over to OpenAI / GPT-5.5.

  1. Between Gemini Advanced and OpenAI ($20 tiers), which model actually writes better content? We need something that excels at long-form blogs and strategic planning without sounding robotic.
  2. How do I bypass these local app limits without buying another flat subscription? Is there a smarter way to let her work with local folders without hitting an immediate wall?

Thanks for any advice!

reddit.com
u/antonusaca — 1 day ago
▲ 0 r/ollama

I replaced my monthly API costs with local models (Ollama). Highly recommend this for bootstrapped founders.

As a solo founder of a streetwear brand, I was bleeding cash renting cognition from cloud providers to automate my ops. I recently moved my backend to what I call a Sovereign Stack running Ollama locally on my Windows machine.

It took a bit of configuration, but it gets me 80% of the capability of frontier models for exactly zero dollars. I actually used it to help deploy my latest storefront architecture completely autonomously. If you are a bootstrapped founder and your API bills are scaling faster than your revenue, I highly recommend looking into local inference to handle your routine AI tasks. Happy to answer any questions about the setup!

reddit.com
u/BoatSpecialist3846 — 1 day ago
▲ 63 r/ollama

This sub has become a cesspool of vibecoded slop

We need a bot that automatically rejects any post that begins with "I built a..."

reddit.com
u/CynicalTelescope — 2 days ago
▲ 22 r/ollama

After 1 month use of ollama cloud, here is my price experience

https://preview.redd.it/9hyo0sf8x92h1.png?width=1216&format=png&auto=webp&s=e4879686703d6c9eec7469d44776b5b88e436dfd

INFO:
this is the screenshot of cc-switch.
i am using ds-v4-pro with ollama local proxy to cloud.

per session, you gain around 6.5M token ( including cached) (6.3M input, 0.1M output)
* ollama doesn't provide cached token count.

per full session = 16.6% weekly usage

roughtly you can have ard 140M token (included cached).

at 93% cache hit rate (what i see in opencode-go) , it worth to $1 per session only....

i am not gonna to renew my ollama anymore...

reddit.com
u/Guilty_Nothing_2858 — 1 day ago
▲ 10 r/ollama

Starting my own llm at home

Im looking to have a coding agent that can be used in vscode like copilot but with ollama. What can I do to use qwen in vscode? As well as what specs are recommended for someone trying to vibe code projects with a decent quality.

UPDATE: Seems like if I want to get any good alms to run evidently I need to at least do 3k. We'll see how it goes.

reddit.com
u/Vesaloth — 1 day ago
▲ 144 r/ollama+14 crossposts

Glia – Local-first shared memory layer (SQLite-vec + FTS5 + Offline Knowledge Graph)

Hey everyone,

I wanted to share a project I've been working on called Glia. It is a 100% offline, local-first RAG and memory layer designed to connect your AI web chats (Claude, ChatGPT, DeepSeek) with your local developer tools (Claude Code, Cursor, Windsurf) using a unified local database.

I wanted something lightweight that did not require pulling heavy Docker containers or subscribing to third-party memory APIs. I settled on a Node.js + SQLite architecture running sqlite-vec (for 768-dim float32 embeddings) alongside SQLite FTS5 for hybrid search, powered completely by local Ollama instances.

We just launched a live website that outlines the details and demonstrates the features in action:

Technical Stack & Features:

  • Hybrid Search Retrieval: SQLite-vec (using nomic-embed-text locally) + FTS5 keyword prefix matching (porter stemmer).
  • Surgical Sentence-level Trimming: Chunks are sliced into sentences. When a prompt is intercepted, only the exact matching sentences are pulled out of the vector store instead of the whole paragraph. It cuts LLM prompt bloat by ~90-95% in my benchmarks.
  • Knowledge Graph Extraction: An offline task queue uses a local LLM (llama3.1:8b via Ollama) to extract entity triples (subject-relation-object). These are stored in a SQLite facts table (or Neo4j if you run the full Docker compose profile) and fused with the vector retrieval score.
  • HyDE (Hypothetical Document Embeddings): Queries are pre-processed to generate a hypothetical answer, which is embedded together with the original query to bridge semantic gaps.
  • Concurrency: Running SQLite in WAL (Write-Ahead Logging) mode allows the browser extension dashboard and active MCP sessions to read/write concurrently without locking.
  • PII Redaction: Aggressive scrubbing of JWTs, API keys, emails, and IPs in the extension before data is saved.

The extension works on Claude.ai, ChatGPT, DeepSeek, Gemini, Grok, and Mistral. The MCP server runs out of the same backend database for your terminal agent or Cursor.

You can set it up with a single command: npx glia-ai-setup

Glia is completely open-source (MIT). If you like the local-first approach or want to contribute to the SQLite vector pipeline, PRs are very welcome, and a star on GitHub helps the project get discovered!

I would appreciate any feedback on the SQLite hybrid search scaling, the scoring fusion algorithm (RAG pipeline details are in RAG_PIPELINE.md), or local graph extraction performance!

u/Better-Platypus-3420 — 2 days ago
▲ 3 r/ollama

I built a local Qwen2.5-VL desktop tool that lets you ask questions about any part of your screen (using Ollama + live overlays)

I built a fully local desktop app that brings vision-language reasoning directly onto your screen. It runs Qwen2.5-VL:7B locally via Ollama and lets you query any region of your desktop in natural language.

Workflow

  • Select any region of the screen (snipping-style)
  • Ask a question in plain English
  • The model returns structured coordinates via Ollama
  • Results are rendered as a clickable overlay directly on top of the screen

What it can do

  • Object localization: (“where is the cat?” → bounding box)
  • Multi-object detection: (“show cat and dog”)
  • Counting: (“how many people are in this region?” → numbered markers)
  • Video reasoning: frame-by-frame analysis + aggregation over time

Core Idea (Coordinate Mapping)

The model outputs normalized coordinates (0–1000). A deterministic mapping layer converts them into exact screen pixels, making it stable across:

  • Windows DPI scaling
  • Multi-monitor setups

No heuristics - just deterministic coordinate mapping.

Video Mode

Since Qwen2.5-VL is image-based, video is handled by: frame sampling → per-frame reasoning → aggregation into final answer.

Tech Stack

  • Model: Qwen2.5-VL:7B (Ollama, fully local)
  • UI: PyQt6 overlay (click-through UI)
  • Capture: OpenCV + mss
  • Privacy: 100% offline, no telemetry, no cloud calls

MIT licensed.

Repo: https://github.com/tomaszwi66/qlens

Curious about edge cases, failure modes, or interesting things people would try to break this with.

u/Funny-Shake-2668 — 1 day ago
▲ 22 r/ollama

Mac Studio Ultra 192GB for local AI — can you actually tell the difference vs Claude Opus for browser automation?

Currently using OpenClaw with Claude Opus 4.7 for browser automation workflows — pulling listings, researching properties, drafting documents, running multi-step agent tasks. Paying $280/month between Claude and Codex subscriptions.

Seriously considering a Mac Studio M4 Ultra 192GB to run local AI and cut that bill down. From everything I've read, the best local setup gets you to roughly 85% of cloud quality.

My main questions for anyone who's actually run both side by side:

  • For routine browser automation (multi-step tasks, form filling, research workflows) — is the gap noticeable day to day?
  • Where does local actually fall short vs Opus in your experience?
  • Is the 192GB worth the $7k or does the $3,999 128GB Studio cover most of the same ground?

Not a developer, more of a power user running automated real estate workflows. Privacy is a plus but mainly trying to figure out if the quality drop is something I'd feel constantly or just on edge cases.

reddit.com
u/Soft-Conference-9992 — 2 days ago
▲ 12 r/ollama

Which AI model should I use on a MacBook Pro M4 Pro with 24 GB RAM?

I use Claude Code via Ollama to manipulate files and folders on my MacBook.

I’ve tried smaller models like Gemma 4 and Qwen 2.5 Coder in 7B, but they don’t work well (or maybe I just don’t know how to use them properly).

I’ve also tried larger 14B models, such as Qwen2.5‑Code‑14B, but when I run a prompt, my MacBook slows down a lot, sometimes freezes for a few seconds, and I have to wait several minutes. I was wondering if this is normal.

reddit.com
u/Resident-Cut5371 — 2 days ago
▲ 4 r/ollama

I built a coding agent in Go that puts a secret-scanning firewall between your code and the LLM (works with Ollama too)

Every AI coding agent I've used treats security as a permission prompt: "allow this bash command? y/N". That's fine for catching rm -rf / mid-agent. It does nothing about the prompt that just got built from your repo and is about to ship a .env value, a private key, or a customer ID to api.anthropic.com.

So I wrote gnoma, a coding agent in Go where security isn't a permission UI — it's a layer the rest of the code can't bypass.

Architecture, top to bottom:

  • Outbound firewall on the provider boundary. Every provider — Anthropic, OpenAI, Gemini, Mistral, Ollama, llama.cpp — is wrapped in a SafeProvider. There is one code path from gnoma's internals to any LLM endpoint, and it goes through a scanner that runs regex patterns (AWS keys, GCP service accounts, Stripe, GitHub PATs, private-key PEMs, etc.) plus a Shannon-entropy detector on the outgoing message and system prompt. Hits are redacted, blocked, or warned per config — before the network call.
  • Tool-result redaction on the way back. A git diff that surfaces a private key, a cat .env, a curl response — all scanned before the LLM ever sees them. Same scanner, opposite direction.
  • TOFU plugin pinning. Plugins (which can ship hooks and MCP servers — i.e. arbitrary binaries running as you) get their plugin.json SHA-256-pinned on first load. Manifest changes on disk = plugin refuses to load. SSH host-key discipline, applied to LLM tooling. No opt-out.
  • TOCTOU-safe path canonicalization. The classic sandbox escape — "leaf doesn't exist, so EvalSymlinks errors, so the caller skips the symlink check, so the write proceeds through a symlinked parent and lands outside the workspace" — gets defeated by walking back to an existing ancestor, resolving it, then rejoining the tail.
  • Permission modes with deny rules that are bypass-immune. Six modes (default, accept_edits, bypass, plan, deny, auto). Deny rules fire before any mode check, including bypass. Compound commands like echo ok && rm -rf / are split with a proper POSIX shell parser, so an rm -rf deny isn't smuggled past in a && chain.
  • Incognito. Ctrl+X toggles a mode where the session isn't persisted, the router doesn't learn from the turn, and there's no on-disk trace of the conversation.

What it actually is, beyond the security layer:

A provider-agnostic coding agent. Multi-armed bandit router across whatever providers you have configured — cloud or local. A tiny SLM (≤1B, on Ollama / llama.cpp / llamafile) classifies every prompt and handles the trivial ones itself so the heavy model only runs on real work. MCP servers, skills, hooks, plugins. One static Go binary, CGO_ENABLED=0, no Node/Python runtime.

What it doesn't do:

  • Not a full network sandbox. The scanner is on the LLM provider boundary; if a tool you allowed shells out to curl, that's still on you.
  • The plugin pin covers plugin.json, not the binaries it references. Treat the plugin directory itself as a filesystem-permissions trust boundary.
  • No published benchmark numbers. The value prop is the architecture, not a score.

Install:

# pre-built binary (linux / macos / windows × amd64 / arm64)
# grab the archive for your platform:
https://github.com/VikingOwl91/gnoma/releases

# go install
go install somegit.dev/Owlibou/gnoma/cmd/gnoma@latest

# docker (multi-arch)
docker pull ghcr.io/vikingowl91/gnoma:latest
docker run --rm -it -v "$PWD:/workspace" ghcr.io/vikingowl91/gnoma:latest

# from source
git clone https://github.com/VikingOwl91/gnoma && cd gnoma && make build

Point at any OpenAI-compatible endpoint:

gnoma
gnoma --provider ollama   --model qwen2.5-coder:3b
gnoma --provider llamacpp                          # uses whatever your llama-server reports

Apache-2.0. Source: https://github.com/VikingOwl91/gnoma

Happy to go deep on the firewall design, the TOFU threat model, or the path canonicalization edge cases.

u/MrViking2k19 — 2 days ago
▲ 2 r/ollama+3 crossposts

.md files are not Memory

A folder of .md files is not memory.

It’s a storage dump.

Useful AI memory needs more than “search old notes and pray”:

- semantic recall, so related ideas surface even when wording differs

- entities, different terms for the same thing don’t become random blobs

- relationships, so the system knows how things connect

- provenance, so it can trace where facts came from

- correction + forgetting, because stale memory is worse than no memory

- background consolidation, because raw chat logs are mostly sludge

Thoth uses a local personal knowledge graph + FAISS semantic search + graph expansion + document ingestion + wiki export.

So yes, you can still get readable notes.

But underneath, the assistant isn’t just rifling through markdown like a raccoon in a filing cabinet.

It’s building structured personal context it can retrieve, update, connect, and reason over.

That’s the difference between “I saved your notes” and “I actually know what matters.”

Relevant references:

  1. FAISS docs: efficient similarity search and clustering of dense vectors.

    https://faiss.ai/

  2. Microsoft GraphRAG: combines text extraction, network analysis, LLM prompting, and summarisation for richer understanding of text datasets.

    https://www.microsoft.com/en-us/research/project/graphrag/

  3. GraphRAG survey on arXiv: graphs encode heterogeneous and relational information, making them useful for retrieval-augmented generation.

    https://arxiv.org/abs/2501.00309

  4. Thoth README memory features: personal knowledge graph, typed relations, FAISS semantic recall, graph expansion, document extraction, wiki export, Dream Cycle refinement.

    https://github.com/siddsachar/Thoth

u/Acceptable-Object390 — 2 days ago