r/llamacpp

I got pi running fully local on a 4B model — with web search and no API keys
▲ 209 r/llamacpp+2 crossposts

I got pi running fully local on a 4B model — with web search and no API keys

For a while now I've been running pi entirely on my laptop -- unsloth's Gemma E4B on llama.cpp, no cloud, no API keys, nothing leaving the machine. Thinking level, image parsing, KV cache retention -- all working.

What surprised me is how genuinely useful it gets the moment it can search the web.

The tiny extension I published pi-smart-web-search adds a web_search tool with no api key needed. It fetches DDG's html output, runs it through wreq-js -> linkedom -> Defuddle (inspired by pi-smart-fetch), and then parses the output's links.

I wrote the whole setup up end-to-end as a gist -- llama.cpp, the model, a chat-template fix, pi, and the search/fetch tools — so you can reproduce the fully-local flow yourself. (links below)

I've run Gemma E4B on an M4 MacBook Air (16GB) and on a M2 MacBook Pro (32GB), but haven't tested Linux yet (I don't have a machine with a dedicated GPU), so if you run it there I'd love a report.

I'm not overselling it, just genuinely after feedback on the approach: the DDG scrape, small-model agents, anything you'd do differently.


I call the project 'Humble Pi' :shrug: -- And I look forward to the next 4b model.

Links


CORRECTION: When I made this post, I forgot to include the instructions for the custom jinja template for the E4B model in the gist. The template that ships with E4B drops prior thinking blocks from history, which forces the KV cache to recompute the entire conversation on every turn.

If you've already followed the gist, please switch to the updated llamagemma4b alias and follow the new "Install the E4B reasoning template (one time)" section to download the template — then re-run source ~/.zshrc and restart the server.

TLDR: Using the custom jinja template should improve the speed substantially.

u/joematthewsdev — 8 days ago
▲ 1 r/llamacpp+1 crossposts

I need help to run local Hermes Agent on my rig. llama-cpp self compiled

Hey folks.

For weeks I try to run a "good setup" for a local Hermes agent.

This is my Hardware:
- Ryzen 9 5950X 48GB DDR4 3600 some NVME disks blablabla
- 2x RTX 3080 12G
- 2x RTX 3090 24GB
- 1x 1500 NZXT PSU
- 1x Corsair 750W PSU

So a quite capable system with 72GB VMEM, so I thought.

Software:
- Fedora 44 Workstation
- llama-cpp compiled from source
- Hermes AI agent
- local LLM (I do not want any cloud llm, I need my data in my house all the time)

Startup parameters right now (working kinda):

~/Tools/llama.cpp/build/bin/llama-server \

-hf llmfan46/gemma-4-31B-it-uncensored-heretic-GGUF:Q8_0 \

--mmproj-offload \

--host 0.0.0.0 --port 8080 \

--reasoning-budget 4000 \

--reasoning-budget-message "... thinking budget exceeded, let's answer now." \

--jinja \

--chat-template-kwargs "{\"preserve_thinking\": true}" \

-c 256000 \

-np 1 \

-ngl 99 \

-t 16 \

-b 2048 -ub 1024 \

-fa on \

-fit on \

--cache-type-k q8_0 --cache-type-v q8_0 \

--no-warmup \

--slot-prompt-similarity 0.1 \

--cache-prompt --no-context-shift \

--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.1 \

-ts 2.2,2.2,0.9,0.9 -sm tensor

That runs about 1000t/s pp and 30 t/s tg, What is fine with my for Q8 and KV-k and KV-v Q8.

My problem is this are these:
- llama-cpp kv cache reprocessing kind of often (every 5 messages or so). This takes 1-2 minutes at ~1000t/s pp.
- I need to hold hands with gemma4 all the time, it gets a complex task, and always reports back after a few minutes instead of just juggling along on its own. I made a /goal and "told it" to not ask back and work on its own. But it stops all the time and tells me how great it did and where we are right now.
But what I want is, that it runs for an hour or so without waiting all the time...

I tried the A3B MOE versions, they are fast, but completely unusable for agentic work.
I tried qwen 27b A LOT, but the kv reprocessings are worse, that it takes hours for a few runs because constant KV cache reprocessings.

I tried qwen 27B on vLLM on the two 3090 only with the club-3090 project. That was "fine" but also not really good. It crawled to a halt after about 24h every day, that I had to restart the vLLM server to get it out of the 0.1 t/s mud hell.

Are my parameters for llama-cpp wrong that lead to my "agent asking on every turn" problems? Or what can I do?

reddit.com
u/OddUnderstanding2309 — 9 days ago
▲ 459 r/llamacpp+1 crossposts

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

Edited : "Qwen3.6-27B Q4_K_M on a single RTX 3090: native 256K context at 38.6 tok/s with 72 MiB of resident KV, needle recall 88-100% at 6% residency, harness accuracy unchanged (36/36 vs full cache)."

On the same hardware, generation speeds doubled and VRAM usage dropped significantly (21GB to 17.5GB) while maintaining full context accuracy

Yt video of fahd --> https://youtu.be/8rTVCRWvRDo?si=MYiVrQQltbSsMAOP

Link to git hub - https://github.com/Luce-Org/lucebox-hub/tree/main/optimizations/kvflash

Quality loss?? --> "Quality verdict (harness ground truth, base-vs-base control included): full results in RESULTS.md. Outputs are not guaranteed byte-identical to the full cache on long generations (the masked kernel path rounds differently — a different deterministic lineage), but correctness is identical: 36/36 vs 36/36 across HumanEval, GSM, MATH, and agent suites."

u/9r4n4y — 14 days ago
▲ 11 r/llamacpp+1 crossposts

I made a lightweight C++ wrapper for llama.cpp

I recently released MemoriaForge, a lightweight C++ wrapper around llama.cpp that aims to make local LLM integration simple and straightforward.

It provides an easy-to-use API for loading GGUF models, managing conversations, injecting context, and generating responses without dealing directly with llama.cpp internals.

Small example:

#include <MemoriaForge.h>

int main() {

    MemoriaForge::LLMSession llm("models/model.gguf");

    llm.chat("Hello!");

    return 0;
}

The project has just reached version 1.0.0 and I'd love to hear your feedback, suggestions, ideas, or criticism.

Here is the repo: https://github.com/canuconde/MemoriaForge

u/Funny-Assignment-804 — 8 days ago
▲ 3 r/llamacpp+2 crossposts

vllm vs llama.cpp vs ollama vs sglang

whats your take?

do you manage to get single developer/person workflows spawning subagents to gain from the parallel-optimized engines?

from:

https://github.com/murataslan1/local-ai-coding-guide/blob/main/guides/runner-comparison.md

Are you a single developer on desktop?

├─ Yes → Do you want simplicity? → Ollama

│ Want fine control? → llama.cpp

└─ No → Running a team server?

├─ High throughput needed → vLLM

└─ Structured JSON outputs → SGLang

u/rs38 — 12 days ago