r/Qwen_AI

▲ 23 r/Qwen_AI

What do you use Qwen for?

I think I'm the one who using Qwen as an AI companion 😅 everyone else discussing about local setups, benchmarks and whatnot.

reddit.com
u/Tall_Instance3457 — 17 hours ago
▲ 11 r/Qwen_AI

Does anyone have any tips on how to subscribe to Qwen 3.7?

Hello everyone, how are you? I hope you are all well.

We already know that free trials end quickly and are only meant to prove the model's capabilities.

Where can we sign up for Qwen for development use?

The AI coding plan is practically impossible to subscribe to (and I'm referring to the $50 plan).

Any option (that isn't extremely slow and has a reasonable cost) ?

reddit.com
u/Intelligent-Taste-36 — 18 hours ago
▲ 28 r/Qwen_AI

Local LLM - privacy first - doctor

I need some advice. I’m a family doctor and I’d like to use a local model to help me reconstruct the medical history of my new patients the day before their appointment.

Here’s the idea: for each patient, I paste the text content of their available medical reports (without personal information) into the chat and ask the model to generate a short summary of the patient’s medical history and the tests performed, along with their results. Being able to get a sense of the patient before even seeing them would be a huge help, but I don’t want the data to leave my computer.

My computer is a laptop with an Intel 155H processor and 32GB of DDR5 RAM. Which model could I use? Or would the models suitable for my computer not be able to do a decent job?

reddit.com
u/point_red — 1 day ago
▲ 17 r/Qwen_AI+1 crossposts

The LLM market never stops! The Qwen 3.7 Max preview is coming soon!

I found information on LinkedIn today about the Qwen 3.7 Max preview. I haven't tested the 3.6 preview, but the 3.6 Plus version, while I was programming, is simply excellent!

I got along very well with Qwen 3.6 Plus.

Version 3.7 Plus is also being released.

Is anyone else satisfied with Alibaba's templates?

Qwen 3.7 Preview

Qwen 50$ coding plan vs Claude x20 Max plan Usage

If anyone has used both I would appreciate it if they would tell me whats the difference in usage between them since I cant find anywhere a direct comparison.

reddit.com
u/Aizenvolt11 — 1 day ago
▲ 31 r/Qwen_AI

autodidact – a self-evolving local-first AI agent with Qwen 3.5 8B

I'm pretty passionate about local LLMs and self-learning AI. I've always wondered: why can't an AI agent work like a human? Have a local brain; when asked, think first; if unsure, ask someone smarter (a cloud model, or search); then learn from the answer so next time you don't need to ask.

That's why I have been trying to build autodidact, an open-source AI agent that learns from its cloud queries - the local model handles what it knows, escalates to a cloud model when uncertain, then distills the response into permanent local memory. Next similar query gets answered locally, for free. And the local brain is default to Qwen 3.5 8B.

In a 30-query session on my dev workload: 67% local-or-memory, $0.70 saved vs an all-cloud baseline. The more you use it, the cheaper and faster it gets.

This is just v1.x, which supports documents and codes ingestion through "autodidact learn <path to documents>", and let you chat with both local and cloud models, with a confidence evaluation and routing mechanism to decide the request should be handled local or cloud, and learning mechanism for the local model to learn from every cloud escalation. I planned a lot for v2, which includes tool usage, skills and tools learning etc.

Please try and let me know if the idea makes sense:

Repo: https://github.com/BuffaloTechRider/Autodidact

Install: pip install autodidact

Quickstart: autodidact init && autodidact learn <code or document path> && autodidact chat

https://reddit.com/link/1ti6cf7/video/0o745mxj572h1/player

Happy to answer questions.

reddit.com
u/pavel6490 — 2 days ago
▲ 69 r/Qwen_AI

Qwen 3.7 Plus Preview thinks I'm a time traveler because it doesn't know it's 2026.

So I was testing out the newly released Qwen 3.7 Plus Preview on the chat interface (you can see the model name selected in the top left). I uploaded a MetaTrader 5 screenshot and asked if the chart looked real or fake.

Instead of looking at the candle data, the model's new thinking process went down a hilarious rabbit hole. It caught the date on my taskbar (19-05-2026) and confidently concluded that I went into my Windows settings, unchecked "Set time automatically," and manually set my clock to the future just to mess with it.

It literally gaslit itself because its internal cutoff makes it think we aren't in 2026 yet. Peak LLM behavior. 🤦‍♂️😂

Qwen 3.7 Plus has incredible reasoning skills... unless you ask it what year it is.

u/Mediocre_Roll3073 — 2 days ago
▲ 35 r/Qwen_AI+1 crossposts

The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b

One way I like to test new models, is by one-shoting (with a good prompt) a single webpage clone of the classic arcade game pacman. I usually do 3 attempts and keep the best one. So far all of them, including anthropic, chatgpt and google models, have failed, most of them miserably. The best one until now was GLM 5.1

That was until I tried it with Qwen 3.6 27b F16. Out of 3 attempts, 2 were the best by far, with the top result only having minor errors! However, as soon as I dropped to 8bit quantisation, I could not replicate those good results even after trying 5+ times. This goes to show what I have saying for a long time, based on my experience: there is a world of difference between a 16bit and a 8bit quant, despite most people claiming it is lossless, or nearly lossless.

The results were so good, and since it just happened that I was testing the llama.cpp MTP speculative decoding PR (not yet merged at that time) with my own quants, and developing my own fixed jinja chat template for Qwen 3.5/3.6, I thought why not try to push Qwen 3.6 27b F16 through a proper agentic coding workflow. I think the results were brilliant, and they speak for themselves. You can try the full single page game here:

https://guigand.com/pacman

Lessons learned and observations:

* A good chat template is critical. The official chat template was unusable due to it being only targeted at vLLM, and therefore full of errors in other tools. I started with community templates, which were improvements, but still had many quirks. This is why I started fixing the bugs one by one in the official templates, and slowly improving it. The beginning of the agentic sessions were painful due to many quirks and errors. But slowly it improved, and once I got the template well tuned, it felt like I had unlocked a new level of intelligence in the model.

* MTP speculative decoding does not accelerate all tasks identically. Basically it is most efficient at deterministic task like coding, and least at creative tasks like brainstorming. I wrote about it here: https://www.reddit.com/r/LocalLLaMA/comments/1t9gcar/mtp_benchmark_results_the_nature_of_the/ - For this pacman development, my generative tok/s varied between 8 tok/s and 18 tok/s depending on the task. For reference, without MTP, I get 6.6 tok/s with the same model and quant.

* Not all harnesses are equals both in terms of code quality but also in terms of impact on speed. Most of use already know that the coding harness has a huge impact on quality, with Claude Code being considered the gold standard; this is what I use for normal daily coding. In this case I started with Qwen CLI, mostly because of the chat template problems, on the principle that if there was one harness more likely to better handle Qwen LLM specifics, it would be their own harness. I was actually pleasantly surprised, and Qwen CLI delivered far beyond what I was expecting! In the later stages, I switched back to Claude Code, mostly to verify that the final chat template was working properly there too. I did not notice any improved process or code quality. What I noticed though, is that developing in Claude Code was a lot slower than in Qwen CLI! This is due to all the extra prompts built within Claude Code. With a local model that has such a slow tok/s, it can make the difference between being usable, and between being borderline hair pulling...

* Context management and caching is super efficient in this model. Do not interfere with it. It works great, let it do its thing. Do not use any skill, plugin, etc, that manipulates the cache or context. This will result in confusing the model and making it a lot dumber and error prone.

* Tool calls, context compaction, shell usage, subagents, parallel subagents, work flawlessly. Initially it did not though, and it took me a long time and lots of work to get it right through chat template fixes and improvements. I actually only used context compaction for testing, and it was fine, as usual in Claude Code.

* High context is usable without too much degradation. Maximum context size is 256k tokens I believe. Most of the time I planned the tasks to stay below 100k, but there were a few times I pushed it slightly over 150k. I did notice slightly reduced capabilities, but nothing major. The main reasons why I tried to keep it low is to get the best reasoning capabilities, as with all other models, but also speed started to decrease as the context usage grew.

* Apart from Gemini, this is the first model that impressed me with its audio knowledge. As a composer, musician, psychoacoustic scientist, and audio engineer, I pay a lot of attention to good audio. In this case, I tasked it to do some advanced audio manipulation and creation. All the audio in the game comes from Qwen having programmed the web audio synthesizer in a highly advanced and complex way. This is not midi, not simple wavetables, not samples. It takes into account psychoacoustic properties tuned to human hearing, with the use of harmonics, distorsion, layers, various effects. Truly impressive work. The only exception is the waka-waka sound, for which I had to make it use a sample (the same method was used in the original arcade game).

* I can live with slow token generation speed. I used to think that I needed a minimum of 70 to 80 tok/s for viable development. But this was usable, gave me time to do other things in parallel, and also to better reflect on the agentic tasks. I would probably not use it for large projects, with my current hardware, but for small to medium project, it is definitely acceptable.

If you read until here, let me know what you think, and I hope you enjoy the game.

Dev environment: macOS, apple silicon M2 max, 96GB RAM, llama.cpp server with OpenAI and Anthropic API endpoints.

>Edit: Qwen Code has a default timeout of 8 mins, and a default maximum response size of 8000 tokens. With a slower model., like this one, I was getting frequent timeouts initially. And with large planning/brainstorming/coding sessions, I was occasionally getting the response truncated, which required reprocessing. I solved it my making the following changes to my ~/.qwen/settings.json file:

  "modelProviders": {
    "openai": [
      {
        ...
        "generationConfig": {
          ...
          "timeout": 1800000,
          "maxRetries": -1,
          "samplingParams": {
            "max_tokens": 32768
          }
        }
      }
    ]
  },
u/ex-arman68 — 2 days ago
▲ 31 r/Qwen_AI

qwen3.6 27b int4 does user support tickets better and insanely faster than Q8

Running 3 x rtx 3090's was using Q8 version of 27b on llama.cpp, saw some posts about how fast the autoround int4 version was on 2 x 3090's so I tested it. It is insanely faster and seems to follow my ticket skill/workflow WAY better. the Q8 version seems to think about what its doing and try different things to complete the ticket even though I have a ticket workflow built for it. the int4 autoround version seems to just follow the ticket and do a great job. A 5 min ticket on int4 from a few tests will take the Q8 version 15-20 min sometimes. Does this seem correct for Q8 when it comes to work like doing user tickets?

reddit.com
u/skinnyzaz — 3 days ago
▲ 15 r/Qwen_AI

TurboQuant on 16 GB VRAM

I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute. Currently running 64K context with q8_0/q4_0 KV cache and ~915 MiB to spare.

Tried domvox/llama.cpp-turboquant-hip, but it OOMs at 512 tokens, the fixed overhead from codebooks and lookup tables alone blows past 16 GB. Now that I've freed ~600 MB by switching quants, I have ~1.6 GB headroom before KV allocation.

Anyone found a way to reduce TurboQuant's fixed VRAM cost, or gotten it working on a 16 GB card with a large model? Or is it just fundamentally designed for cards with more headroom?

reddit.com
u/Haunting-Stretch8069 — 3 days ago
▲ 118 r/Qwen_AI

Qwen 3.7

Despite being on thinking mode on default, qwen 3.7 is FAST. Also it CAN process images but it keeps insisting it cant

u/Content_Impress_847 — 3 days ago

I just find out that my PC can support model with 122B but at 2/3bit

As u can see on this picture my pc can support this model is anyone use this model on his local machine and what do you think about it?

I just love to see what my local machine can do with all this

I ask Chat GPT and even Gemini they both told me that this version is so smart but with 2Bit will be slower however for the writing and artistic stuff and prompting it will be like a God mod on my local machine the speed is going to be betwewn 3 to 5 tok/sec is anyone try this model in the lowest quality?

u/MaleficentCrab4672 — 2 days ago
▲ 69 r/Qwen_AI

Qwen 3.7 Plus and Max vs Opus 4.7

Qwen 3.7 plus preview (first one):

https://chat.qwen.ai/s/97d45748-645d-407e-8191-87bd996f8901?fev=0.2.51

Qwen 3.7 max preview (second one):

https://chat.qwen.ai/s/ebb4f625-9df8-4c8f-b03a-701f604a3c39?fev=0.2.51

Opus 4.7 Adaptive Thinking (last one)

Honestly Qwen 3.7 Max seemed to have made the best and most realistic one although it kinda screwed up the dpad placement.

Curious to know if anyone else has compared the new 3.7 series to other SOTA models.

u/Power_spy — 3 days ago
▲ 78 r/Qwen_AI+1 crossposts

Pi Agent makes very nice combination with limited hardware. Running qwen3.6 35B A3B IQ4 at ~22t/s with 160k context on 6 vram 64 RAM.

Some days ago I shared some findings regarding running qwen 3.6 in this repo https://github.com/igpdev/rtx4050-local-llm-qwen3.6-35B in case would help someone.

(Post copied from original llamallm as here is no option to reshare from other community)

After some tweaks playing around with llamacpp flags, found this config that allows quite nice and usable workflow with qwen 3.6 35B with 160k context using Bartowski IQ4_NL version

The key here is Pi Agent with its simplicity and small context, I did a small exercise with a prd document asking to build a simple habit tracker using nuxt framework and sqlite, and playwright for e2e testing.

It clearly does the job faster than wen using Opencode, (Yes, opencode is still usefull too, but with the limited speed regarding the setup, Pi feels very fluid). it made the right call tools to setup everything including the playwright e2e testing framework.

Pi agent is for local setups with small vram and some usefull RAM what Linux to old laptops. It can provide you with a very decent agentic workflow knowing how to define clear tasks. To make it simple, I just made the pi system prompt to be as silent as possible, given that I also prefer a ralph loop process that do not need verbosity but just to fullfill the goal.

Of course I have to admit is not oriented for users not understanding what they are doing, can be dangerous given its yolo default mode. I feel is oriented to users that love the neovim/emacs customization philosophy.

In case someone is interested or has suggestions here is the flags: ____

TURBO_LAYER_ADAPTIVE=1 llama-server \

-m ~/models/Qwen_Qwen3.6-35B-A3B-IQ4_NL.gguf \

--host 0.0.0.0 \

--port 8084 \

-ngl 999 \

-c 160000 \

-n 8192 \

-b 2048 \

-ub 2048 \

--cont-batching \

--threads 12 \

--threads-batch 16 \

--prio 2 \

--poll 50 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--flash-attn on \

--cache-prompt \

--cache-reuse 512 \

--ctx-checkpoints 10 \

--n-cpu-moe 999 \

--temp 0.6 \

--min-p 0.05 \

--top-k 40 \

--top-p 0.95 \

--repeat-penalty 1.05 \

--jinja \

--reasoning auto \

--reasoning-budget 8192 \

--no-mmap

____

And same disclaimer. I am not an expert, I just keep experimenting pushing to the limit that low spec machine. One really starts to learn a lot when going local.

u/Interesting_Arm_7250 — 3 days ago

So i gave same video (that shows subtitles on screen) to qwen 3.6 27b and qwen 3.5 omni flash but 27b skipped a part while omni dont. I tried three time, got same results. Do anyone knows that why this is happening even if both support videos.

u/9r4n4y — 2 days ago
▲ 26 r/Qwen_AI+3 crossposts

fine-tuning 27B hybrid models on strix halo (ryzen ai max+ 395 / gfx1151, 128 gb unified memory) — full guide, patches, orchestrator

Sharing a guide I just published for fine-tuning 27B+ LLMs on AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S / gfx1151, 128 GB unified memory). MIT licensed.

Repo: https://github.com/h34v3nzc0dex/strix-halo-llm-finetune-guide

None of the individual pieces are novel — kernel patches, ROCm 7.13 nightly, FLA, bitsandbytes, LoRA, llama.cpp. The intersection (Strix Halo + gfx1151 + FLA + Qwen3.5 hybrid at 27B) isn't documented anywhere I could find, and getting it stable took a lot of dead ends I'd rather other people skip.

Stack tested: kernel 6.19.14, PyTorch 2.11.0+rocm7.13.0a20260506, ROCm 7.13 nightly, FLA 0.5.1 patched, bitsandbytes 0.50.0.dev0 built from source for gfx1151, llama.cpp b867+. Hardware: Corsair AI Workstation 300 (Sixunited AXB35-02 board, BIOS 3.07).

Things the guide actually covers that I had to figure out the hard way:

  • PyPI bitsandbytes ships zero ROCm binaries. From-source build with -DROCM_VERSION=83, plus a runtime symlink libbitsandbytes_rocm83.so → libbitsandbytes_rocm713.so so bnb's HIP detection on PyTorch 2.10/2.11 stops complaining.
  • FLA's Triton kernels crash on gfx1151 (RDNA 3.5) with num_warps &gt; 4 (Triton#5609) and a tl.cumsum + tl.sum codegen interaction (Triton#3017). Idempotent re-patch script included.
  • In-process Trainer eval at 27B / 8192 seq length is structurally broken on unified-memory APUs — either kernel TTM page allocation failure from fragmentation, or memory watchdog SIGKILL when free RAM drops under ~8 GB. Eval is moved out-of-process via a bash orchestrator aligned to save_steps, waiting for full GPU release between train and eval, with a JSONL trend log.
  • Mainline kernel .deb run-parts double-dir bug on Ubuntu 24.04+ leaves packages half-configured. Repack script included.
  • /srv perms regressing to 0750 mid-training breaks importlib.metadata path traversal and crashes TRL's create_model_card. Cron watchdog restoring 755.

Verified result: in-progress production fine-tune of Qwen3.5-27B (hybrid, 16 full-attention + 48 GatedDeltaNet layers), bf16 LoRA r=128/α=256, eval rolling at 0.13 loss / 96.5% token accuracy, ~11 min/step, ~4-day total runtime.

Feedback and issues welcome, especially from people on different AXB35-02 boards or non-Corsair Strix Halo systems — I'd like to know what's board-specific vs. generic.

https://preview.redd.it/8i3ebs27h00h1.jpg?width=649&format=pjpg&auto=webp&s=1a4fe453e9e46c97b71a14b993b9536288169ca1

reddit.com
u/Outrageous_Bug_669 — 3 days ago