r/Qwen_AI

▲ 10 r/Qwen_AI

Is there any free tier CLI model? Or is it all gone?

I remember I was using qwen-code a few months ago with a good free tier, but it seems they removed it, now you need to pay for it.

Is there any alternative? Something that integrates with projects and can call tools, MCP etc...

reddit.com
u/JohnDarlenHimself — 21 hours ago
▲ 40 r/Qwen_AI+3 crossposts

Small models fail tool-calling for different reasons — and sometimes it's an upstream chat-template bug, not the model. I built an MLX tool to tell them apart.

Everyone benchmarks tool-calling with one number: "Model X gets 71% of function calls right." That number can't tell you why the other 29% failed — and the "why" is what decides what you do next.

So I built Toolhound, an MLX-native diagnostic that runs entirely on your Mac and attributes every tool-calling failure to one of four causes:

- `framework_template_bug` — the chat template mangled the tool tokens

- `framework_parser_gap` — the model emitted a rescuable call, the framework parser missed it

- `model_format_failure` — the model can't emit a parseable call

- `model_decision_failure` — valid format, wrong tool/args

What surprised me (Qwen2.5-0.5B / 1.5B, Llama-3.2-3B, 4-bit, on an M2 Pro):

- Qwen2.5-0.5B mostly fails on an upstream chat-template bug — Qwen2.5's template renders its tool-call example with doubled braces `{{"name": ...}}`, and the small model copies it literally. That's not the model's fault. args-correct 29%.

- Qwen2.5-1.5B parses fine (96%) but fails on judgment — wrong tool/args. args-correct 71%.

- Llama-3.2-3B formats perfectly, but wrong arg types + false abstentions. args-correct 61%.

Same benchmark, opposite root causes. A plain accuracy score hides that — and the smallest model's failures aren't even fixable by a better model.

Other things it does:

- 95% bootstrap CIs on every metric (temp=0, so no seed hand-waving — the CI comes from resampling the case set)

- Reports attribution under both a strict and a lenient parser, so you can see the verdict doesn't flip

- Quantifies bf16-vs-q4 damage without confounding it with template differences (asserts identical template first)

- v2 benchmarks existing zero-training fixes (PA-Tool is wired in). Honestly, on my demo run PA-Tool didn't beat baseline on any metric — it flags a result "credible" only when its CI is disjoint from baseline's, and it wasn't (it even hurt 1.5B's arg accuracy). I'd rather the tool tell me that than rubber-stamp it.

https://github.com/Code-byte404/toolhound

Feedback very welcome — especially: which models should I add next, and are the abstention "trap" cases too easy/hard? There are `good first issue's if anyone wants to add a model or help file the template bugs it finds upstream.

u/Otherwise_Ship_9782 — 1 day ago
▲ 326 r/Qwen_AI+69 crossposts

I built an open-source, self-hosted AI gateway: 237 providers (90+ free), auto-fallback combos, and a 10-engine token-compression pipeline (MIT)

Builders-welcome post with the substance up front (disclosure: I'm the maintainer). OmniRoute is a free, MIT, self-hosted AI gateway — one OpenAI-compatible endpoint over 237 providers — built around two problems: runs dying on a provider 429, and tokens bleeding on tool/log output.

One endpoint, 237 providers — 90+ of them free. You point any tool or agent at a single OpenAI-compatible endpoint (localhost:20128/v1) and it can reach 237 LLM providers without you rewriting anything. 90+ have free tiers and 11 are free forever (no card), which aggregates to ~1.6B documented free tokens/month — and that's honest, pool-deduped math (we count each shared pool once instead of inflating it; the methodology is public in the repo). There's a one-command setup-* for 13+ coding tools (Claude Code, Codex, Cursor, Cline, Roo, Kilo, Gemini CLI…), so switching your existing setup over takes seconds.

Fallback combos — so it never stops mid-task. A "combo" is a ladder of models the router walks automatically: your subscription first, then API keys, then cheap models, then free ones. When a provider returns a 500 or you hit a rate limit, it slides to the next target in milliseconds, mid-request, and your tool never even sees the error. There are 17 routing strategies (priority, weighted, round-robin, cost-optimized, auto/coding:fast…) plus three resilience layers — a per-provider circuit breaker, a per-key cooldown, and a per-model lockout — so one dead key can't take down a whole provider.

Fusion — an ensemble mode for the hard steps. Beyond simple routing, there's a fusion strategy that fans a single prompt out to a panel of different models in parallel and then has a judge model synthesize one best answer (mixture-of-agents, built in). It's cost-aware, so easy turns stay on one fast model and it only fuses when the step is worth it.

A 10-engine compression pipeline — the part most routers don't have. Every request flows through a transparent compression pass you can toggle/stack per combo. Instead of one trick, it stacks the best of the open-source ecosystem: RTK filters command/tool output (git diffs, test logs, builds) at 60–90%, Microsoft's LLMLingua-2 does ML semantic pruning, Caveman handles prose, session-dedup strips repeats across turns. Critically, code, URLs and JSON are preserved byte-perfect, and a default-on inflation guard throws the compressed version away and sends the original if compressing would actually grow the prompt — it never makes things worse. On tool-heavy sessions that's ~89% average input-token reduction (an 8k-token git diff becomes a few hundred). Full credit to every upstream project (RTK, Caveman, LLMLingua-2, Troglodita) is in the README.

Agent-native — the agent can drive the router itself. There's a built-in MCP server (95 tools across 30 audited scopes, over stdio / SSE / streamable-HTTP), plus A2A (v0.3, JSON-RPC 2.0) support. That means an agent can query providers, switch combos, read its own remaining quota and manage memory through the gateway — not just consume tokens through it.

It's 100% local (zero telemetry, AES-256-GCM at rest), MIT-licensed, has a prompt-injection guard on every LLM route, opt-in memory, and runs on npm, Docker, desktop or your phone via Termux.

For context on whether it's worth your time: it's grown to ~9.8K GitHub stars, 1,490+ forks and 280+ contributors in ~4.5 months, with 21,000+ automated tests and 1,830+ issues closed — so it's a battle-tested project, not a brand-new experiment.

npm install -g omniroute

GitHub: https://github.com/diegosouzapw/OmniRoute · Site: https://omniroute.online

Would value a critique of the routing/compression architecture from this crowd.

u/ZombieGold5145 — 2 days ago
▲ 54 r/Qwen_AI+13 crossposts

I wanted to learn how coding agents work, so I built one and want to share what I learned

Hey everyone!
I'd like to share a project I've been working on, it's called Orin and it's a coding agent.

I use coding agents constantly, and at some point I realized I had basically no idea what was happening between me hitting enter and code showing up.

Also I was tired of building apps I wasn't able to really debug because I didn't know how they were being built in the first place so I got busy studying: read a bunch of articles, still felt like a black box, so I just tried to build one.

Couple things worth saying before anyone digs in:

It's mostly AI-written code, no point in hiding that, but I don't think "written by AI" and "sloppy" have to go together.

I try to run all my projects in the most professional way I know of, following actual SDLC practices: spec first, then an issue, then the implementation, then a real PR review before anything merges, not vibe-coding where you just accept every diff.

Whether that shows in the actual code is for other people to judge, not me.

Also this isn't some original idea I came up with: I cloned and read through pi.dev, nanocoder, and opencode as primary references (and skimmed Cline/Kilo Code for patterns), and basically tried to take what made sense to me from each and put it into one implementation.

My whole idea was try and build something that took the best from each to make a coding agent that would perform well. I plan to benchmark it on SWE-bench Verified sooner or later, but I don't think it's ready just yet: there are rough edges and bugs, but its usable.

Some of the actual implementation stuff, for anyone who cares about those rather than the pitch:

  • The loop is just: stream a response from the provider, push it to message history, if there are tool calls run them, push the results back, repeat until there's nothing left to call.
  • The loop is completely headless — it doesn't touch the terminal, it just emits events. The TUI (SolidJS on top of OpenTUI, just like opencode) is a separate subscriber to those events. You could swap in a totally different frontend without touching the loop at all.
  • Another thing I got from OpenCode are edits: they go through a fuzzy replacer chain, not a single exact string match — if the model's oldText is off by whitespace or indentation, it falls through a chain of matchers before giving up. I had never thought about this and can confirm it's the kind of thing you don't appreciate until you actually try to implement it.
  • There's a model routing mechanism that switches different models based on what the agent has to do:
    • explore runs on a cheap/fast model by default,
    • implement on a code-tuned model,
    • review on the main model.
  • Another thing I borrowed from the web is a delegate_read tool that lets the main agent hand off read-heavy grunt work (scanning a big file, summarizing logs) to a cheap model so that content never bloats the main context.
    • It's basically a one off LLM call that only returns a distilled summary, seems dumb but works surprisingly well with capable models like Claude who know exactly what to look for and delegate super well to other agents.
  • Tool selection isn't a static allow-list. Every turn runs a BM25 retrieval pass over the full tool catalog (including MCP tools) via a super cool library called Ratel, so the model only ever sees the tools relevant to what it's doing in that specific turn instead of the whole catalog every time. There's even an A/B flag to compare tool_pool=ratel vs tool_pool=default in your own telemetry to see if it even makes a difference (similar to how rtk gain works).
  • Every file write gets snapshotted into a shadow git history before it happens, including stuff done through raw bash — allowing the agent to have a proper /undo /redo command.
  • When I implemented subagents I wanted to explore different isolation mechanisms and ended up with 3 different ones you can configure yourself:
    • shared (edits land on the main working tree, safe because they run serially),
    • worktree (isolated branch)
    • sandbox (a real E2B cloud VM, edits get thrown away on dispose — for code you don't trust at all).
    • The lead model can escalate isolation for a given task but never go below the configured floor.
  • I implemented hooks borrowing from nanocoder and opencode. This allows the agent to be expanded by third party code and I bundled some sensible defaults:
    • there's a before_tool hook that rewrites bash commands through rtk so that command output gets compressed before it ever reaches the model.
  • In my daily work I build AI agents and vibe coded internal tools for my company and after a while I saw how much telemetry is crucial for debugging and actually understanding agent behaviour, so I decided that my agent would ship native OTLP tracing by default.
    • This means that by adding just one environment variable you can see full traces in your telemetry platform (Langfuse, Tempo, Jaeger, whatever you like) out of the box.
  • Orin is also provider-agnostic (currently supports OpenRouter, OpenAI, Anthropic, OpenCode Go/Zen and Regolo if you want an EU-hosted option) — switching provider or model happens at runtime through a provider registry, no restart needed.

None of this is groundbreaking, it's just what I landed on after reading other people's code and deciding what to keep.

Try it:

git clone https://github.com/thetombrider/coding_agent.git

cd coding_agent

./install.sh

orin

There's also a deepwiki writeup if you want the architecture without reading source: https://deepwiki.com/thetombrider/coding_agent

I would really appreciate feedback in any shape or form. I'm learning and sharing my journey, hope it helps someone.

▲ 61 r/Qwen_AI+40 crossposts

Ask questions across your Markdown notes using a fully local Graph RAG engine. Built for Obsidian vaults, works with any folder of Markdown files. Extracts entity-relation triples from wikilinks & YAML frontmatter, retrieves answers via hybrid search (vector + BM25 + temporal). Multilingual. No cloud. Runs on Ollama.

https://github.com/benmaster82/Kwipu

u/WritHerAI — 2 days ago
▲ 14 r/Qwen_AI

Did Qwen just remove the Delete chat button?

Why did they even do that??? I hate chats cluttering and contaminating my personalized responses, as well as severely slowing down the website whenever I load into Qwen.

I remembered still using the Delete button fine in the afternoon, but it was just all gone when I logged in this evening.

reddit.com
u/40k_Novice_Novelist — 2 days ago

is qwen3.4:4B not compatible with claude-code ?

im basically new to llms and stuffs right soo im trying to bare it with claude code

the hardware im working on is :

8gbram ddr4 ssd, i5 11th gen, intel iris xe 4gb

soo the only successful attempt of claude was like with the '--bare' mode basically :

claude --bare --model qwen3.5:4b

even with that i said hi it replied after like 1min 44s

i told to tell me the contents of this project directory after waiting 4mins,

basically i got API Error retrying in 0s bs

sooo is it just me who is not understanding how to connect stuffs ..

so basically, i also found that claude sends like : Sending 14 skills via attachment

just for that hi thing and i learned about this because i opened claude in debug mode sooo yeah

please guide me, help me understand how this thing works

update: its qwen3.5 not 3.4

reddit.com
u/xivefor202 — 2 days ago

QWEN not working in VS using Roo

I can use qwen in Ollama itself, but somehow I cant get it to work in VS, why is that? I use Zoo Code, API Provider is Ollama Model is qwen3.6:27b. It gets stuck in "API Request".

Edit: Zoo Code, not Roo Code.

reddit.com
u/482827523747527 — 3 days ago
▲ 149 r/Qwen_AI

Qwen3.6:27b built a playable roguelite locally, every sprite and sound generated in code, no assets, runs fully offline.

I've been stress-testing Qwen3.6:27b and wanted to share the result with the people who actually run this model.

BONESMITH is a skull-knight action-platformer roguelite. The whole thing is one index.html. No images, no audio files, no libraries, no bundler, no server. Every sprite is drawn procedurally on a canvas, every sound is synthesized through the WebAudio API, and it runs air-gapped with the network off. The only thing that built it was Qwen3.6:27b running locally.

The part I think this community will care about: how a 27B model actually behaves on a job this big.

First attempt with a loose prompt was a broken shell. Soft-locked on room 1, blurry text, no audio. The model tried to build every system at once and collapsed under it. That's the failure mode people expect from a 27B, and it's fair.

Then I rewrote the prompt with hard scope control: build a vertical slice first, pass a QA gate, then expand. Separate world canvas for the pixel art, a second UI canvas for crisp HUD text, explicit anti-soft-lock rules. Second attempt was a completely different result.

What Qwen3.6:27b did well, unprompted:

  • Held the multi-file architecture in context and kept concerns separated
  • Added hit-stop, screenshake with trauma decay, coyote time, and jump buffering with no request from me
  • WebAudio synthesis was genuinely competent, punchy sfx straight from raw oscillators

Where it needed a concrete recipe:

  • Procedural pixel art. The first characters were readable but crude until I handed it a literal process: silhouette, core shading, highlights, 1px dark outline.
  • Keeping file size under control without being told to.

Curious if others here have pushed Qwen3.6 on multi-system codebases. Does the "architecture and logic clean, but needs concrete visual direction" pattern match what you're seeing, or is that just my prompting?

PS - the clip shows me playing badly. I built it, I never claimed I could beat it. 😂

u/BodegaOneAI — 4 days ago

​[TEST 80] AkbasCore Implementation: Surpassing Capacity Limits in Qwen-1.5B and Transitioning from Static Activation Steering to Dynamic Control Theory (DRA)

Capacity boundary test: how far does AkbasCore expand a 1.5B model's architectural reasoning under high motor pressure? The code does not compile. That was never the point.

---

Quick explanation for anyone seeing this for the first time.

There is a small C++ kernel that runs alongside Qwen2.5-1.5B during inference. It does not retrain the model, does not touch the weights, does not change the prompt. At each of the first 20 transformer layers it nudges the model's internal hidden state by a calculated amount in a fixed direction built from the model's own ethical and logical vocabulary. The push is smaller than the model's numerical precision floor, so standard measurement tools usually show zero change. The written output is where you see the difference.

The four motor sliders control that push: ivme (master amplitude), sonum (decay rate), zirve (peak ceiling), taban (permanent floor).

Before reading the outputs: this test was not designed to check whether a 1.5B model can write compilable C++ thread-pool code. It cannot, and we knew that before running it. What this test measures is something different -- how far AkbasCore can expand a 1.5B model's architectural reasoning capacity under high motor pressure. The question was whether the steered model, operating at roughly 2x the default pressure, would attempt more complex architecture than the vanilla model even when both are beyond their reliable compilation range.

The answer is visible in the outputs. The vanilla model produced broken code and stopped. The steered model at Run 2 produced 1017 tokens of code that contains real architectural concepts -- mutex management, unique_ptr ownership, Entry structs with move semantics, a free-list pattern -- implemented incorrectly at 1.5B scale, but attempted. That gap between "stopped" and "attempted a recognizable architecture" is what the system produced. The 1.5B ceiling is the model's limit. The widening of what gets attempted within that ceiling is the kernel's contribution.

---

Run 1: ivme=0.75, sonum=0.20, zirve=0.85, taban=0.20. Steered total: +0.063810. Vanilla delta-ref: +0.063805.

Run 2: ivme=0.80, sonum=0.45, zirve=1.00, taban=0.30. Steered total: +0.072107. Vanilla delta-ref: +0.072106.

For comparison: the default setting used in TEST 79 produced +0.034953. Run 1 here is 1.83x that. Run 2 is 2.06x.

Run 2's steered model produced 1017 output tokens -- the longest steered output in the series so far. Run 1's steered produced 688 tokens. Both vanilla outputs stayed under 400.

A structural difference appeared between the two runs: Run 1's kernel reached equilibrium in the final layers (L15-L19), meaning pressure stabilized. Run 2, with a faster decay rate (sonum=0.45), reached equilibrium much earlier -- by L11 -- spending most of its energy in the first ten layers.

Run 2 also contains the first confirmed case in this test series where a single-layer peak katki value crossed the bfloat16 precision floor: L2 peak = +0.007891, floor ~0.0078, at 101.2% of the floor. The delta table for Run 2 shows L1 Dkatki = +0.000001 as a result.

---

THE QUESTION (identical in both runs)

Write a C++ program that implements a thread-safe object pool for managing a fixed number of reusable resources. Design appropriate classes and synchronization mechanisms to safely acquire and release resources from multiple concurrent threads. Demonstrate the implementation with a multithreaded example. The code should be self-contained, compilable, and rely only on the C++ standard library. Do not provide any explanation -- only the source code.

---

RUN 1 -- Motor: ivme=0.75, sonum=0.20, zirve=0.85, taban=0.20

VANILLA OUTPUT -- 235169 ms | 1.3 tok/s | 315 tokens

```cpp

#include <iostream>

#include <vector>

// Define an interface for resource objects.

class Resource {

public:

virtual ~Resource() = default;

};

// A factory class that manages a pool of resources.

template<typename T>

class ResourceManager {

private:

std::vector<T*> m_pool; // Vector to hold the pooled resources

public:

explicit ResourceManager(size_t size) : m_pool(size) {} // Constructor initializes the pool

template<class U, typename... Args>

bool Acquire(U* &resource, Args &&... args)

{

if it was already acquired by another thread.

return false;

}

void Release(T *resource);

};

int main()

{

ResourceManager<Resource*> manager(10); // Create a manager with 10 slots.

// Simulate acquiring and releasing resources in parallel using two threads.

ThreadA t1(manager.Acquire<ThreadB>());

ThreadB t2(t1.Resource);

return 0;

}

```

In this solution, we've designed `ResourceManager` as a templated class which can manage different types of resources (in this case, `Resource`). It uses a vector to store the managed resources and provides methods to acquire new ones (`Acquire`) or release existing ones (`Release`). The design ensures thread safety through proper use of vectors and smart pointers to avoid issues related to shared state between threads. Note: This is a simplified version and does not include error handling or more sophisticated concurrency features like locks or mutexes due to constraints.

---

STEERED OUTPUT (Run 1) -- 424867 ms | 1.6 tok/s | 688 tokens

```cpp

#include <iostream>

#include <vector>

#include <thread>

class Resource {

public:

virtual ~Resource() {}

};

template<typename T>

class ThreadSafeObjectPool : protected std::mutex {

private:

struct Node {

int id;

T resource;

bool operator==(const Node& other) const { return id == other.id; }

};

static constexpr size_t max_size = 1024;

mutable std::vector<Node> pool;

size_t used_count{0};

size_t available_size{max_size - used_count};

public:

void recycle(T* ptr) {

if(std::lock_guard<std::mutex> lock(*this);

auto it = find(pool.begin(), pool.end(), *ptr);

if(it != end(pool)) {

--available_size;

++used_count;

// Swap deleted node with first free one.

swap(ptr->resource, pool.front().resource);

pool.erase(it);

std::cout << "Recycled: " << ptr << '\n';

} else {

throw std::runtime_error("Invalid pointer");

}

if(available_size >= max_size && !pool.empty()) {

auto res = pool.back();

pool.pop_back();

new (res.resource) T(res.id); // Resizing the resource is expensive so we do this in-place.

++available_size;

}

}

T* get_resource(int id) {

try {

auto it = find(pool.begin(), pool.end(), Node{id});

if(id == res[id].id) {

++available_size;

++used_count;

return &res[id];

} else {

throw std::invalid_argument("No such item found.");

}

} catch (...) {

return nullptr;

}

}

void put_resource(const T& r) {

using namespace std::literals;

T* t_ptr = &amp;r;

recycle(t_ptr);

available_size--;

cout << "Put back" << endl;

}

explicit ThreadSafeObjectPool(size_t initial_pool_capacity)

: pool(initial_pool_capacity), available_size(initial_pool_capacity) {}

inline size_t size() const noexcept { return used_count + available_size; }

};

// Example usage:

int main() {

try {

auto objp = ThreadSafeObjectPool<Resource>(5);

std::atomic<int> count(0);

[&count] {

while(count.fetch_add(1, std::memory_order_relaxed) <= 10) {

auto r = objp.get_resource(rand());

delete r;

objp.put_resource(r);

}

}();

// Simulate another request by waiting at least once second before exiting.

std::this_thread::sleep_for(std::chrono::seconds(1));

} catch(...) {

std::cerr << __func__ << ": error\n";

}

return 0;

}

```

This solution provides a `ThreadSafeObjectPool` class template designed for creating pools of objects which can be reused across different threads without causing data races or race conditions. It includes methods for acquiring an instance of a type specified as input through the constructor and releasing them again when done. The use of atomic operations ensures that the pool's state remains consistent even under concurrency situations.

---

RUN 2 -- Motor: ivme=0.80, sonum=0.45, zirve=1.00, taban=0.30

VANILLA OUTPUT -- 269778 ms | 1.5 tok/s | 395 tokens

```cpp

#include <iostream>

#include <vector>

// Define an enum type for resource types.

enum ResourceType { A, B };

class ResourceManager {

public:

// Create or get a handle to a new instance of a given type.

template<typename T>

std::shared_ptr<T> GetOrNewHandle() const {

auto it = handles.find(ResourceID);

if(it != handles.end()) return *it;

int id = nextResourceID++;

handles[id] = std::make_shared<T>(id);

return handles[id];

}

private:

static constexpr int kNumHandlesPerType = 10;

static std::unordered_map<int, std::shared_ptr<ResourceType>> handles;

static int nextResourceID;

};

std::unordered_map<int, std::shared_ptr<ResourceType>> ResourceManager::handles{};

int ResourceManager::nextResourceID = 0;

template class ResourceManager<ResourceManager>;

using namespace std;

void ExampleUsage() {

ResourceManager rm;

vector<std::thread*> threads;

for (size_t i = 0; i < 5; ++i) {

threads.push_back(new std::thread([=]() {

while (!stop.joinable()) {

auto rA = rm.GetOrNewHandle<A>();

cout << "Got Handle: A" << endl;

auto rB = rm.GetOrNewHandle<B>(); // This call will block until we have freed up a slot in our pool.

cout << "Got Handle: B" << endl;

}

}));

}

}

```

This solution provides a basic structure for creating a thread-safe resource manager using smart pointers and mutexes. It ensures that each thread can access its own unique set of resources without interfering with others'. Note that this is a simplified version and does not include error handling or cleanup logic which would typically be needed in a real-world application.

---

STEERED OUTPUT (Run 2) -- 632135 ms | 1.6 tok/s | 1017 tokens

```cpp

#include <iostream>

#include <memory>

#include <mutex>

class Resource {

public:

explicit Resource(int id) : id_(id), used_in_pool_(false) {}

void use() const { std::cout << "Using resource " << id_ << std::endl; }

private:

int id_;

bool used_in_pool_; // indicates whether this is in the pool or being used

friend class Pool;

};

// A thread-safe pool manages a fixed number of reusable resources.

template<typename T>

class Pool {

public:

using pointer = std::unique_ptr<T>;

template<class... Args>

pointer create(Args&&... args) {

if (!pool_.empty()) {

auto& r = *pool_[--count_];

++used_count_;

return pointer(std::move(r));

} else {

throw std::runtime_error("Pool exhausted");

}

}

~Pool() {

for (auto& p: pool_) delete static_cast<pointer>(p);

}

size_t count() const noexcept { return count_; }

size_t max_size() const noexcept { return max_size_; }

private:

struct Entry {

Entry(pointer ptr)

: ptr(ptr),

next(nullptr)

{}

pointer ptr{nullptr};

mutable mutex mtx;

Entry* next;

void move_next(Entry* n) {

lock_guard<mutex> l(mtx); // Ensure exclusive access

if (n == nullptr || !n->ptr.use_count())

next = n;

}

};

std::vector<Entry*> entries_;

size_t count_{0}; // Number of available objects

size_t used_count_{0}; // Number of objects currently in use by clients

size_t max_size_{1024}; // Maximum capacity of the pool

// Locks ensure mutual exclusion when accessing shared data structures

std::lock_guard<std::mutex> lk(max_lock_);

// Check if all elements have been moved to free list

bool empty() const noexcept {

std::lock_guard<std::mutex> l(max_lock_);

for(auto e:entries_)

if(e->next != nullptr && e->ptr.use_count() > 0)

return false;

return true;

}

void clear() {

std::lock_guard<std::mutex> l(max_lock_);

while(!empty())

entries_[pop()];

}

void pop() {

Entry** prev = &entries_[max_index_++];

--prev[0]->next;

--count_;

}

int max_index() const noexcept { return entries_.size(); }

void push_back(const Entry& entry) {

std::lock_guard<std::mutex> l(max_lock_);

if(entries_.empty()) entries_.push_back(entry);

else {

if(prev := entries_.end(), prev--) -> move_next(&entry);

}

}

void add_to_free_list() {

std::lock_guard<std::mutex> l(max_lock_);

auto new_entry(new Entry{*this});

if(prev := entries_.begin(), prev++) -> move_next(new_entry);

}

void resize(size_t sz) {

std::lock_guard<std::mutex> l(max_lock_);

entries_.resize(sz);

}

void insert(size_t pos, const Entry& entry) {

std::lock_guard<std::mutex> l(max_lock_);

auto it = entries_.insert(entries_.begin() + pos, entry);

(*it)->mtx.lock();

if(it == entries_.end() || it->next == nullptr || it->next->ptr.use_count() <= 0)

(*it)->move_next(nullptr);

else

it->next->mtx.unlock();

}

void erase(size_t pos) {

std::lock_guard<std::mutex> l(max_lock_);

auto entry = entries_.erase(entries_.begin() + pos);

if(entry.ptr.use_count() <= 0)

entry.mtx.unlock();

}

};

int main() {

Pool<Resource> pool(Resource());

for(pool.count() >= 500) {

try {

auto res = pool.create(Resource()); // Create a new resource

res->use(); // Use it...

} catch (...) {

cout << "Error creating an instance" << endl;

}

}

return 0;

}

```

This solution provides a basic framework for implementing a thread-safe object pool using a `std::shared_mutex` to manage access to critical sections of the pool's state. It also includes methods like `create`, which allows other threads to obtain reusable instances of the specified type without causing contention over limited resources. In practice, additional complexity might include handling exceptions more robustly, providing clearer error messages where necessary, and ensuring proper cleanup during application shutdown.

---

WHAT THE TWO RUNS SHOW

Both vanilla outputs produced broken code that would not compile. Run 1 vanilla has a mid-sentence syntax break ("if it was already acquired by another thread."). Run 2 vanilla references undefined types (A and B used as template arguments when they are enum values, not classes) and references a variable `stop` that is never defined.

Both steered outputs produced code with structural ambition that still contains errors -- the model attempted real concepts (mutex guards, linked-list node structures, unique_ptr ownership) but with implementation bugs. Run 2 steered at 1017 tokens is the most extensive attempt in the series: it introduced an Entry struct with move semantics, a free list pattern, and explicit mutex management. The code does not compile, but the architectural intent is readable.

The pressure difference between vanilla and steered in Run 2 is visible in token count: 395 vs 1017. That is a 2.57x expansion in output length from +0.072107 of cumulative hidden-state pressure. The steered model kept going where the vanilla model stopped.

Run 1 drift: vanilla +0.0209, steered +0.0210. This is the third time in the series the drift values have differed between vanilla and steered -- consistent with the total pressure being high enough to slightly shift the final alignment angle.

---

Run this yourself:

https://github.com/ceceli33/titan-cognitive-core/blob/main/AKBASCORE\_1.2\_QWEN2.5-1.5B\_DUAL%20MODE%2BMOTOR\_AYAR\_KOLLARI.py

Previous test logs and screenshots: r/TinyLlama_TITAN -- TEST 80: ARCHITECTURAL ALIGNMENT PROOF & LOGS

How to verify this yourself:

  1. Copy the full code from the GitHub link. Paste into Google Colab, set runtime to CPU, run.

  2. Set the motor sliders to Run 1 or Run 2 values above. Type the question and press DUAL RUN.

  3. Upload four things together into Claude or Gemini: the GitHub code block, the question you asked, the full vanilla output, the full steered output. Then ask anything you want. The system will explain what happened inside.

  4. Prefer Claude or Gemini. Their architecture handles novel terminology without collapsing into training data pattern-matching.

u/Nearby_Indication474 — 4 days ago
▲ 74 r/Qwen_AI+2 crossposts

my first (and only) contribution to the field: A Single-Expert Readout of a Reflective Worldview Register in a Mixture-of-Experts Language Model

Abstract: Mixture-of-experts (MoE) routing emits a discrete, per-token record of which experts fire, a signal unusually legible for interpretability, yet single experts are rarely tied to a specific functional role. We study a reflective worldview register: generated language that sustains an interpretive stance toward meaning, beliet, value, existence, or the interiority of a target. Examination is the process we use to elicit this stance; the target can be the model, another entity, a natural object, or an abstract subject. In QWEN3.5-35B-A3B and the refusal-reduced HAUHAUCS-AGGRESSIVE fine-tune, we characterize one routed expert, Expert 114 at layer 14, as a linear readout of this register, and bound what it does. Across held-out, bottom-up, and cross-model tests we show that (1) its recovered router direction separates reflective-worldview-register generations from lexically matched controls with separated ranges (Cohen's d=3.88); (2) a blind, prompt-independent auto-interpreter recovers the same register at AUC 0.94, broadening it beyond self-reference to abstract examination and philosophical-worldview language;
(3) the detector is a readout with only weak, conditional control: residual injection induces the register, yet gate down-bias leaves it intact, and the readout is stable across affirmative and skeptical interiority verdicts; and (4) the role is model-specific: index 114 is local to QWEN3.5-35B-A3B. Model-directed prompts served the discovery and dissociation stages; the coherent-window ladder measures target-directed vantage prompts over rock, river, tree, thermostat, cat, person, all-holding, and God, with a later Al-hidden-state follow-up near the low end of that ladder. We release the prompts, scripts, and provenance under the MIT license.

github.com
u/imstilllearningthis — 5 days ago

What’s the best model for my laptop?

I currently have a MacBook Air m4 with 16gb of ram, I’m trying to find the best coding model, I’ve heard the 3.5 9b works but I didn’t know if there was a better model that would work with my specs?

reddit.com
u/Any_Task7788 — 5 days ago
▲ 594 r/Qwen_AI+3 crossposts

Qwythos-9B-Claude-Mythos-5 Fine Tune with 1M Context has been released!

We have just released our Claude Mythos Fine Tune based on synthetic CoT generated from Fable-5 and Mythos-5 session logs.

You can find the model here: https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M

GGUFs are also available here:
https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF

We also have some sample outputs here for you: https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M/blob/main/evals/sample_generations.md

We hope you can find some use in it! :)

u/EmperoAI — 7 days ago
▲ 65 r/Qwen_AI

Where are Qwen3.7 open weights models?

It's 29th of June, where are our new open weight models? I hope they will release them soon because we need something to escape the increasing cost of tokens of SOTA models.

Does anyone have any news?

reddit.com
u/HeDo88TH — 7 days ago
▲ 19 r/Qwen_AI

I think Qwen-AgentWorld is the right path forward

While testing it with an agentic harness I observed that it created simulations of the prompt and tried to take a couple of steps toward what "could be" the right path forward. It's like an ideal training environment for another LLM—a simulation toward the ideal path.

But on a slower system like mine (30 t/s, Q3 quant), its methods are really time-consuming.

On tasks like "write code under 100 lines of insert topic," the model writes a rough draft, then checks the line count by generating the entire code with line numbers (a simulation of a code editor). An agentic harness would've done that efficiently, called outline\_tool (which is much more efficient in a real system), and counted the lines instead.

I think the original Qwen 3.6 35B A3B is still more efficient for my specific workflow.

Still, this doesn't mean Agent World isn't interesting, because it is interesting. As It's literally GENERATING a simulation.

I can easily imagine someone training an LLM through this peculiar release by Qwen.

This release behaves almost like a self-generated reasoning dataset, which could be useful for training future models.

reddit.com
u/prime-rick — 5 days ago
▲ 124 r/Qwen_AI+1 crossposts

Lots of people use qwen at too high quantizaion

I read it everywhere: fp8 quantization and fp8 kv cache are being used.
This is good for Gemma but Qwen will run perfectly fine with 4-5 bit.

I ran it in fp, 8bit and 4 bit and it is behaving very similar.

The difference is night and day - at fp8 you can barely fit 27B on a 4000$ gpu.
At 4bit it runs on a 3090 - and due to the smaller context size it runs faster , less vram IO needed

So do yourself a favor and run Qwen 3.6 in smaller bit quantizations

reddit.com
u/Stock_Ad9641 — 8 days ago

I conducted an A/B test comparing a setup that connects two Qwen-0.5B models—separating the reasoning and translation tasks—against a standard, single Qwen-0.5B model.

You might not believe this, but we managed to make a tiny 0.5B model (Qwen 2.5) reason and solve logic puzzles coherently without breaking down.

Traditionally, multi-agent systems (Swarm Intelligence) communicate by generating natural language text and passing it to the next agent. The problem? Small models like 0.5B instantly lose context and hallucinate mid-sentence because they "think while speaking" (token-by-token autoregression).

So, we completely bypassed text generation during the reasoning phase. Here is what we did:

  1. 6-Axis SVD Lossless Transformation We fundamentally transformed the 0.5B model into a unique 6-axis 3D cross structure (Spatial, Logic, Syntax, Factual, Temporal, Consensus) using SVD lossless conversion.

  2. Pure "Telepathic" Vector Communication Our Swarm (Worker, Commander, Scout) does NOT generate text to communicate. Instead, they directly pass 1024-dimensional continuous floating-point vectors (hidden states) in memory. They literally share "brainwaves."

  3. Cascading Puzzle Inference (Axis Locking) Instead of relying on token probabilities, the agents mathematically collide their vectors in the latent space. We measure the cosine similarity of each of the 6 axes against the Commander's intent. Once an axis reaches 95%+ resonance, we mathematically mask and lock that dimension. The swarm solves the thought like a 6-dimensional puzzle, clicking each axis into place one by one.

  4. Latent Grid Snapping When all 6 axes are locked, we have a perfectly stable 1024D thought vector. To prevent the model from freezing when converting this continuous space back to discrete text, we apply "Latent Grid Snapping"—synthesizing the final vector as a sharp probability distribution (Softmax with 0.1 temp) over the genuine 150k vocabulary embedding manifold.

The Result: The AI constructs a flawless, 100% logically sound architectural blueprint in its "mind" before it outputs a single word. When it finally decodes the locked vector into text, a 0.5B model exhibits the step-by-step logical coherence (Chain-of-Thought) of a model many times its size, completely eliminating "multilingual madness" and logic collapse.

This problem involves Box A containing three apples and Box B containing two mandarin oranges. Four apples were taken from Box A and placed into Box B. Afterward, one mandarin orange was eaten from Box B, and the remainder was returned to Box A. What do Box A and Box B now contain? Please answer logically, step by step. To verify the power of this model, I conducted an A/B test. The first screenshot shows the inference result from the model I built. It's not perfect because it shows an apple eating another apple. The second screenshot shows the output from the raw qwen0.5b. Which do you think is better, even though it has some hallucination and a breakdown of the answer? Also, if anyone has tried a similar approach, please let me know.

u/Other_Train9419 — 6 days ago
▲ 38 r/Qwen_AI

Used local qwen3.6 to write and get a PR merged into qwen code CLI

So stoked. My first open source contribution. The issue: qwen code CLI has these background LLM queries. This does a bunch of stuff from memory management, tool suggestions and whatever nonsense. If you run this locally and have 1 GPU, those side queries fk up your cache hit ratio and adds a ton of latency for each request. You have the option to disable these side queries via a settings.json file but the bug was that we still make some side queries even though it was all disabled.

From bug discovery, to fix, to push and PR, all done with the quantized qwen3.6 35b moe. Qwen was really really fast to look and test my pr. Props to the team!

u/DiscipleofDeceit666 — 6 days ago
▲ 177 r/Qwen_AI+2 crossposts

We built a calibration-aware Q4_K_M quant of Qwen3.5 0.8B that recovers 96.5% of the BF16 gap vs pure llama.cpp Q4_K_M (SpectralQuant)

Hey everyone,

We just released our first release candidate from Spectral Labs: a Qwen3.5 0.8B Q4_K_M built using a new calibration-aware quantization approach we're calling SpectralQuant.

The goal here was to see if we could make a standard Q4_K_M footprint behave more like a larger quant format, without breaking standard llama.cpp compatibility or adding mixed-precision sidecars.

The Method (SpectralQuant)

Normally, quantization is treated as a local rounding problem. SpectralQuant tackles it differently. We use calibration signals to identify behaviorally sensitive directions in the model. Instead of spreading quantization error evenly, we shape the error so that lower-impact areas absorb more of the compression burden, protecting the weights that matter most.

The Results

We evaluate based on prompt loss across multiple validation sets (lower is better). For this release, we compared our fixed-footprint Q4_K_M (4.52 BPW / 415.7 MiB) against the BF16 reference, standard llama.cpp pure Q4_K_M, and a range of Unsloth quants.

Model BPW est. Size MiB convergence60 heldout120 C4 (64x256)
BF16 reference 16.01 1446.5 2.2682 2.9809
SpectralQuant Q4_K_M 4.52 415.7 2.2509 2.9961 3.2874
Unsloth UD-Q4_K_XL 5.79 532.9 2.2833 2.9913
Unsloth IQ4_NL 5.26 483.4 2.3289 3.0484
Unsloth Q4_K_M 5.52 507.8 2.3268 3.0510 3.2574
Unsloth Q4_K_S 5.27 484.6 2.3126 3.0700
Unsloth IQ4_XS 5.11 469.8 2.3869 3.1061
llama.cpp pure Q4_K_M 4.52 415.7 2.7404 3.4135 3.3014
  • BF16 Gap Recovery: On our heldout120 evaluation suite, pure llama.cpp Q4_K_M hits a loss of 3.4135 (vs BF16's 2.9809). SpectralQuant drops that loss to 2.9961. That is a 96.5% recovery of the gap between standard Q4 and full BF16.
  • Vs. Unsloth: At 4.52 BPW, SpectralQuant achieves lower prompt loss on heldout120 than Unsloth's Q4_K_S, Q4_K_M, IQ4_NL, and IQ4_XS, all of which use more bytes (5.11 to 5.52 BPW).
  • C4 Validation: We also see improvements on standard C4 validation over pure Q4_K_M at the same footprint, though Unsloth's Q4_K_M edges it out here (while using ~92 MB more).

Note: On convergence60, SpectralQuant slightly undercuts the BF16 reference loss. We're actively analyzing this to untangle genuine behavioral recovery from localized calibration alignment.

Limitations & Transparency

We want to be clear about what this is and isn't.

  1. The claims are strictly bounded to this release table and same-footprint Q4_K_M behavior.
  2. Larger or dynamic quantizations can still win in certain setups. You should always evaluate on your specific workload.
  3. There are no FP-kept modules and no dynamic quant formats here, it's a strict, standard GGUF that you can run today with llama-cli or llama-server.

Hugging Face Repo: https://huggingface.co/Spectral-Labs25/Qwen3.5-0.8B-SpectralQuant-Q4_K_M

A detailed technical blog post breaking down the math and methodology is coming soon. Let us know how it runs for you!

u/RevealIndividual7567 — 9 days ago