u/Interesting_Arm_7250

Image 1 — Pi Agent makes very nice combination with limited hardware. Running qwen3.6 35B A3B IQ4 at ~22t/s with 160k context on 6 vram 64 RAM.
Image 2 — Pi Agent makes very nice combination with limited hardware. Running qwen3.6 35B A3B IQ4 at ~22t/s with 160k context on 6 vram 64 RAM.
▲ 78 r/Qwen_AI+1 crossposts

Pi Agent makes very nice combination with limited hardware. Running qwen3.6 35B A3B IQ4 at ~22t/s with 160k context on 6 vram 64 RAM.

Some days ago I shared some findings regarding running qwen 3.6 in this repo https://github.com/igpdev/rtx4050-local-llm-qwen3.6-35B in case would help someone.

(Post copied from original llamallm as here is no option to reshare from other community)

After some tweaks playing around with llamacpp flags, found this config that allows quite nice and usable workflow with qwen 3.6 35B with 160k context using Bartowski IQ4_NL version

The key here is Pi Agent with its simplicity and small context, I did a small exercise with a prd document asking to build a simple habit tracker using nuxt framework and sqlite, and playwright for e2e testing.

It clearly does the job faster than wen using Opencode, (Yes, opencode is still usefull too, but with the limited speed regarding the setup, Pi feels very fluid). it made the right call tools to setup everything including the playwright e2e testing framework.

Pi agent is for local setups with small vram and some usefull RAM what Linux to old laptops. It can provide you with a very decent agentic workflow knowing how to define clear tasks. To make it simple, I just made the pi system prompt to be as silent as possible, given that I also prefer a ralph loop process that do not need verbosity but just to fullfill the goal.

Of course I have to admit is not oriented for users not understanding what they are doing, can be dangerous given its yolo default mode. I feel is oriented to users that love the neovim/emacs customization philosophy.

In case someone is interested or has suggestions here is the flags: ____

TURBO_LAYER_ADAPTIVE=1 llama-server \

-m ~/models/Qwen_Qwen3.6-35B-A3B-IQ4_NL.gguf \

--host 0.0.0.0 \

--port 8084 \

-ngl 999 \

-c 160000 \

-n 8192 \

-b 2048 \

-ub 2048 \

--cont-batching \

--threads 12 \

--threads-batch 16 \

--prio 2 \

--poll 50 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--flash-attn on \

--cache-prompt \

--cache-reuse 512 \

--ctx-checkpoints 10 \

--n-cpu-moe 999 \

--temp 0.6 \

--min-p 0.05 \

--top-k 40 \

--top-p 0.95 \

--repeat-penalty 1.05 \

--jinja \

--reasoning auto \

--reasoning-budget 8192 \

--no-mmap

____

And same disclaimer. I am not an expert, I just keep experimenting pushing to the limit that low spec machine. One really starts to learn a lot when going local.

u/Interesting_Arm_7250 — 3 days ago

Pi Agent makes very nice combination with limited hardware. Running qwen3.6 35B A3B IQ4 at ~22t/s with 160k context on 6 vram 64 RAM.

Some days ago I shared some findings regarding running qwen 3.6 in this repo https://github.com/igpdev/rtx4050-local-llm-qwen3.6-35B in case would help someone.

After some tweaks playing around with llamacpp flags, found this config that allows quite nice and usable workflow with qwen 3.6 35B with 160k context using Bartowski IQ4_NL version

The key here is Pi Agent with its simplicity and small context, I did a small exercise with a prd document asking to build a simple habit tracker using nuxt framework and sqlite, and playwright for e2e testing.

It clearly does the job faster than wen using Opencode, (Yes, opencode is still usefull too, but with the limited speed regarding the setup, Pi feels very fluid). it made the right call tools to setup everything including the playwright e2e testing framework.

Pi agent is for local setups with small vram and some usefull RAM what Linux to old laptops. It can provide you with a very decent agentic workflow knowing how to define clear tasks. To make it simple, I just made the pi system prompt to be as silent as possible, given that I also prefer a ralph loop process that do not need verbosity but just to fullfill the goal.

Of course I have to admit is not oriented for users not understanding what they are doing, can be dangerous given its yolo default mode. I feel is oriented to users that love the neovim/emacs customization philosophy.

In case someone is interested or has suggestions here is the flags:
____

TURBO_LAYER_ADAPTIVE=1 llama-server \

-m ~/models/Qwen_Qwen3.6-35B-A3B-IQ4_NL.gguf \

--host 0.0.0.0 \

--port 8084 \

-ngl 999 \

-c 160000 \

-n 8192 \

-b 2048 \

-ub 2048 \

--cont-batching \

--threads 12 \

--threads-batch 16 \

--prio 2 \

--poll 50 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--flash-attn on \

--cache-prompt \

--cache-reuse 512 \

--ctx-checkpoints 10 \

--n-cpu-moe 999 \

--temp 0.6 \

--min-p 0.05 \

--top-k 40 \

--top-p 0.95 \

--repeat-penalty 1.05 \

--jinja \

--reasoning auto \

--reasoning-budget 8192 \

--no-mmap

____

And same disclaimer. I am not an expert, I just keep experimenting pushing to the limit that low spec machine. One really starts to learn a lot when going local.

u/Interesting_Arm_7250 — 4 days ago
▲ 98 r/Qwen_AI+2 crossposts

Running Qwen 3.6 35B on an RTX 4050 6VRAM and 64RAM (15 to 25 t/s). Sharing my config in case some with similar specs wants to try.

I have been playing around with some configs on this laptop I have for agentic coding. I am a dev mostly using spec driven development and I still like to control what I do but I use a lot agentic workflows, not paying LLM suscriptions given that usually cloud free tiers are enough to pivot between them, so I wanted to try some llamacpp configs to see how much I was able to push this RTX4050.

Thanks to some nice suggestions in LocalLlama and projects like little-coder, I gather some flags and notes that I tried and finally some of those worked on this small machine.

The results: quite nice in fact, is not the fastest thing one would expect but I have been using qwen coder cli with this local model, and agentic workflow works if well delimited and clearing context often.

When I ask something complex like debug some error, or a big task, my approach is to just ask qwen to write the plan in plan.md (not plan mode) then do a /clear context and ask to do plan.md, this way context is not a constant issue. Some average times expected are 4 to 5 minutes so is really good, considering I do not have to worry about the token$. In the screenshot I just share some one shot test by requesting to do a website for book registration. Nothing complex, but yet complex to measure accuracy and planning.

So, just dropping the repo with all info notes and drafts I got in these days while doing trial and errors (optimus llama overheated once and laptop went down). So there it is all the info I think would be usefull, but if you have any more questions let me know.

Note: I am not in any way an expert on this, so take it with a grain of salt. My main stack and field is web development but I like to tweak my linux machine, so I do not guarantee idempotency results following this. But this is why I just saved this as a usefull file and wanted to share in case it helps others (if you see some inconsistencies or mistakes, blame the IA :v). Of course suggestions or improvements you spot are welcome also.

https://github.com/igpdev/rtx4050-local-llm-qwen3.6-35B

u/Interesting_Arm_7250 — 13 days ago
▲ 88 r/emacs

This is my emacs.d with organization by plain lisp files by a single module/ folder with 3 simple areas: core, custom and packages.

I had quite some configs with my main stack that is angular and netcore, but now working purely on vuejs and python, so I keep it simple by just adding my vuejs and python configs. (To be honest, I rarely worry today about lsp configs or autocompletion, with new emerging patterns regarding agentic coding).

(First image: GNU Emacs Manual v 24.5 :v)

So, here is the repo for anyone who wants to take a look.

https://github.com/igpdev/dyst-emacs.d

u/Interesting_Arm_7250 — 21 days ago