u/Civil-Direction-6981

▲ 36 r/Qwen_AI+1 crossposts

We Spent 28 Hours Trying to Beat Int4 on Qwen3.5-9B Using Spectral Decomposition. The Weights Are Near-Full-Rank.

How this was done: The entire research project was supervised and driven by Aura, an open-source multi-agent research framework. Aura decomposes research tasks into a dependency tree, spawns workers, and verifies results. For this project, the agent lineup was: DeepSeek V4 as the orchestrator (task planning, quality gate verification, replan decisions), Codex (OpenAI) as Pro workers (heavy compute: SVD on real weights, model assembly, final evaluation), and Qwen3.5-27B running locally as Flash workers (light compute: dependency installs, model downloads, scripting). All 53 tasks — including the 14 failures and subsequent rebuilds — were managed automatically by Aura's scheduling and verification loop. No human touched the pipeline during execution.

Bottom Line Up Front

What we attempted: A hybrid compression framework (SVD low-rank + Tensor-Train + 4-bit quantization) targeting 5x compression over FP16 — aiming to beat standard int4 by ~25%.

What we found: Qwen3.5-9B's weight matrices are near-full-rank. Every "smart" decomposition — SVD, Tensor-Train, kernel methods — collapsed to the same ~4x ceiling. The final 3.93x result is mathematically equivalent to plain int4 quantization. After 28.6 hours, 53 tasks, and 14 failures, we proved that standard int4 is optimal under realistic quality constraints for this model.

The headline number: 3.93x compression, ppl=10.60 (baseline 6.71, +58%). No better than GPTQ. No better than AWQ. No better than bitsandbytes. The 5x target is structurally impossible — blocked by the embedding layer's irreducible information content.

Is this a failure? As a compression result, yes. As a research finding, no. We now know with experimental certainty that post-hoc low-rank methods cannot beat int4 on this generation of LLM weights. That's worth publishing.

0. A Quick Note on Qwen3.5-9B

Before diving into the compression story, a few words about the model itself — because its architecture matters for understanding why it resists compression.

Qwen3.5-9B is not a vanilla dense transformer. It has a hybrid attention architecture: 8 layers of full self-attention followed by 24 layers of linear attention (DeltaNet-style). This is Alibaba's bet on efficient long-context inference — replace the O(n²) quadratic attention with O(n) linear attention in most layers.

Notable architectural choices:

Feature Qwen3.5-9B Llama 3.1 8B (for reference)
Vocabulary 248,320 128,000
Hidden dim 4,096 4,096
FFN intermediate 12,288 (3.0x) 14,336 (3.5x)
Attention heads (Q) 16 32
KV heads (GQA) 4 8
Full attn layers 8 32
Linear attn layers 24 0
Vision encoder Yes (ViT) No

Three things stand out for compression:

  1. The vocabulary is massive. 248K tokens at 4096 dimensions = 2.03 GB for the embedding alone. That's 11.4% of the model in two tensors. For comparison, Llama 3.1 8B's embedding is 128K × 4096 = 1.05 GB — half the size. This single design decision makes 5x compression nearly impossible before you touch a single FFN weight.
  2. The 3.0x FFN expansion ratio is on the aggressive side. With intermediate_size=12288 (vs hidden=4096), each FFN layer's gate/up projections are [12288, 4096] — wide, fat matrices. This is where 54.2% of the parameters live. If these matrices had low-rank structure, the payoff would be enormous. (Spoiler: they don't.)
  3. The hybrid attention design creates two distinct "species" of weights. The 8 full-attention layers and 24 linear-attention layers have different projection structures — different shapes, different roles, different sensitivity to perturbation. A uniform compression strategy is unlikely to be optimal across both.

The model's FP16 perplexity on wikitext-2 is 6.71 — solid for a 9B model, roughly on par with Llama 3 8B. The hybrid architecture seems to work: you get long-context efficiency without sacrificing short-context quality. But the architectural choices that make it a good model also make it a compression researcher's nightmare.

1. The Ambitious Plan: SAHC Framework

We set out to beat standard 4-bit quantization (~4x compression) using what we called Spectrum-Aware Hybrid Compression (SAHC). The idea: different transformer components have different spectral structure, so compress them differently.

Architecture Breakdown & Attack Plan

Component % of Params Method Theoretical Ratio
C1: FFN (gate/up/down) 54.2% SVD low-rank + 4-bit quant 25.3x
C2: Full Attention (8 layers) 3.8% EYM low-rank + head pruning + 4-bit 8.0x
C3: Linear Attention (24 layers) 11.3% Kernel SVD + value pruning + 4-bit 12.6x
C4: Emb + LM Head 22.8% Tensor-Train + SVD + 4-bit 20.5x
C5: Vision + MTP 7.9% Standard 4-bit ~4x

Projected total: 10–13x compression with controlled perplexity degradation.

The Theory

Eckart-Young-Mirsky Theorem. For any real matrix W, the optimal rank-r approximation under Frobenius norm is given by truncated SVD:

W_r = U_r Σ_r V_r^T

with error:

||W - W_r||_F / ||W||_F = sqrt( Σ_{i=r+1}^{d} σ_i² / Σ_{i=1}^{d} σ_i² )

The key variable is the singular value decay rate. For exponential decay (σ_i ∝ e^({-λi}),) rank r=512 captures >99.9% of the Frobenius norm even at d=4096. For flat decay, you capture essentially nothing. Everything depends on what the actual weights look like.

Tensor-Train (TT) Decomposition. For the 248,320 × 4,096 embedding matrix, we reshape into a high-order tensor and factorize across modes:

W[i, j] ≈ G₁[i, r₁] · G₂[r₁, r₂] · G₃[r₂, j]

With TT-ranks [1, 32, 32, 1] and an SVD core at rank 1,024, the theoretical compression is ~175x with error bound ε ≤ 0.017. The factorization exploits the fact that vocabulary tokens should be distinguishable in far fewer than 4,096 dimensions — or so the theory goes.

Shannon Rate-Distortion. Information-theoretic lower bounds gave three regimes:

  • Independent coding (int4): ~3.3 bits/weight → ~4.8x
  • Joint coding (cross-layer correlation): ~1.0–2.0 bits/weight → ~8–16x
  • Spectral coding (low-rank structure): ~0.1–0.3 bits/weight → 50–178x

The math was clean. The bounds were tight. Everything was provable on paper.

We were wrong.

2. The First Red Flag: SVD Error 21x Worse Than Theory

We compressed all 32 FFN layers using SVD rank-512 + 4-bit quantization:

Compression ratio: 24.00x (logical) / 7.90x (actual with .pt overhead)
Mean SVD error: 0.834
Mean total error: 0.982

To put those numbers in perspective:

  • SVD error 0.83: the rank-512 approximation throws away more signal than noise
  • Total error 0.98–1.02: in 7 of 32 layers, the error exceeded 1.0 — the compressed matrix is worse than replacing it with zeros. A zero matrix has error = 1.0 by definition. We achieved worse than nothing.
  • Theoretical bound: Eckart-Young-Mirsky predicted ε ≤ 0.039 at r=512

That's a 21x gap. The automated pipeline didn't flag it — it just reported "24.00x compression" and moved on to the next step. The compressed model's perplexity was infinity. Not "bad." Not "degraded." Literally inf — the token probabilities collapsed to zero for legitimate tokens.

The Root Cause

We ran a full SVD on FFN layer 15 to inspect the spectrum:

Singular values of gate_proj [12288, 4096]:

σ₁    = 4.65
σ₂    = 4.05
σ₃    = 3.65
σ₄    = 3.55
...
σ₅₁₂  = 2.03    ← 44% of σ₁ — barely any decay
σ₁₀₂₄ = 1.72    ← 37% of σ₁
σ₂₀₄₈ = 1.18    ← 25% of σ₁
σ₄₀₉₆ = 0.95    ← 20% of σ₁ — the smallest is still substantial

The spectrum is essentially flat. Condition number σ_max/σ_min ≈ 4.9. An exponentially-decaying matrix might have σ_max/σ_min ≈ 10⁶ — that's the regime where SVD compression works.

Minimum rank needed for SVD error < 0.10:

  • gate_proj: 3,794 / 4,096 → 92.6% of full rank
  • up_proj: 3,814 / 4,096 → 93.1% of full rank
  • down_proj: 3,755 / 4,096 → 91.7% of full rank

These matrices are near-full-rank. SVD truncation at any meaningful compression level is mathematically guaranteed to fail.

This is a structural property of the weights, not an implementation bug. The Qwen3.5 training process produced FFN matrices where every dimension carries non-negligible information. The SWiGLU activation with 3.0x expansion likely contributes — with that much capacity, the model uses every degree of freedom.

3. Attention Was Even Worse

The same near-full-rank property holds across all attention projections (Q, K, V, O) in all 32 layers. But attention had an additional failure mode.

Even full-rank SVD + int4 (i.e., keeping all 4,096 singular values, only quantizing the factors) failed our forward-pass smoke test. Cosine similarity between original and reconstructed layer outputs fell below 0.95 for attention layers. Why?

The reconstruction error propagates through the multi-head attention computation differently than through an FFN. In FFN, the error is additive per-layer. In attention, the Q·K^(T) softmax amplifies small perturbations through the exponential, and the weighted sum over V compounds them across heads. The same Frobenius error budget that passes in an FFN causes divergence in attention.

Conclusion: SVD-based compression is unusable for Qwen3.5-9B attention layers. Direct weight-only int4 was the sole surviving option.

4. The Embedding Wall

The embedding matrix is 248,320 × 4,096. At FP16, that's 2.03 GB. The LM head is another 2.03 GB (tied weights in this architecture, so they mirror each other). Together they're 22.8% of the model.

Simple arithmetic: to reach 5x total compression when FFN and Attention are capped at ~4x by int4, Embedding + LM Head must deliver ≥100x compression. That's the only path.

We ran a sweep of 7 TT rank configurations:

TT Config              Comp. Ratio    Cosine    Quality Gate (cos ≥ 0.95)
─────────────────────────────────────────────────────────────────────────
[1,   8,    8, 1]       1,424x       0.761     ❌ FAIL
[1,  16,   16, 1]         389x       0.815     ❌ FAIL
[1,  32,   32, 1]         113x       0.857     ❌ FAIL
[1,  64,   64, 1]          35x       0.895     ❌ FAIL
[1, 128,  128, 1]          12x       0.923     ❌ FAIL
[1, 256,  256, 1]           5x       0.941     ❌ FAIL  ← only 0.009 away
[1, 512, 4096, 1]         3.92x      0.990     ✅ PASS  ← full rank

Every configuration above 100x had cosine below 0.86 — catastrophic degradation. Even the [1,256,256,1] config at just 4.85x missed the gate by 0.009. Only full-rank (SVD core at 4,096) passed, giving 3.92x — indistinguishable from direct int4.

Why? Because 248,320 vocabulary tokens genuinely need nearly all 4,096 embedding dimensions to be distinguishable. The embedding is not learning a manifold of dimension ≪ 4096 — it's using the full space. This is likely a consequence of the multilingual, multi-script vocabulary (Chinese characters, Latin, code tokens, special symbols all coexist) plus the vision tokens from the ViT encoder. The semantic space is genuinely high-dimensional.

This is the structural blocker. The embedding doesn't compress because it can't compress — not under any decomposition — because its information content is genuinely ~4 bits per parameter. Qwen's decision to use a 248K vocabulary makes this worse than it would be for a comparable model with a 128K vocabulary, but the fundamental issue (no low-rank structure in learned embeddings) likely applies broadly.

5. Final Results

After abandoning SVD and TT decomposition, the SAHC framework collapsed to its baseline:

Component Final Method Compression
FFN (54.2%) Row-wise direct int4 3.99x
Attention (15.1%) Row-wise direct int4 3.99x
Emb / LM Head (22.8%) Full-rank TT + int4 3.92x
Vision + MTP + Norm (7.9%) Row-wise int4 3.13x–3.89x
Total 3.93x

Perplexity: 10.60 (baseline 6.71, +58%). Acceptable under our ppl < 100 threshold, but a clear degradation from int4 quantization noise.

Every component converged to the same ~4x ratio. The methods differ in name (TT+SVD+int4 vs direct int4), but the information retained and discarded is mathematically identical — 4 bits per parameter in all cases.

6. What This Tells Us About Qwen3.5-9B (and Modern LLMs Generally)

The Model Itself

Qwen3.5-9B is architecturally interesting. The hybrid full/linear attention split (8+24) is a genuine innovation — it's not just "another dense transformer." The model achieves competitive perplexity (6.71 on wikitext-2) with a design optimized for long-context efficiency. For practitioners, it's a solid choice in the 9B class, particularly for tasks that benefit from long context windows where the linear attention layers shine.

But from a compression perspective, the architecture has structural anti-compression properties:

  • The 248K vocabulary is a hard blocker. It's 1.94x larger than Llama 3's 128K. Every byte of compression you gain elsewhere is diluted by these two immovable 2GB tensors. If Alibaba released a Qwen variant with a 128K vocabulary (still large for a multilingual model), the 5x target might actually be reachable.
  • The 3.0x FFN expansion creates matrices where rank reduction is impossible. With more intermediate dimensions than necessary, the model fills them all. A smaller expansion (2.5x or 2.0x) would produce matrices with genuine low-rank structure.
  • The linear attention layers compress differently from full attention, but both end up at the same ceiling. The architecture is diverse, but the irreducible information content per weight is uniform (~4 bits).

The Bigger Picture

Our bet is that this near-full-rank property is not unique to Qwen. Modern LLMs are trained with:

  • AdamW optimizers that don't impose rank regularization
  • Large learning rate schedules that explore full parameter space
  • Massive data budgets that saturate model capacity
  • No structural bottleneck that would force low-rank representations

Every indication suggests that Llama, Mistral, DeepSeek, and other modern LLMs have similarly incompressible weights. If you're working on post-hoc LLM compression, run an SVD on one layer first. Ten minutes of compute will tell you whether your method has any chance.

The most productive compression research directions for modern LLMs are likely:

  1. Quantization-aware training (QAT) — bake low-bit representations into the training process
  2. Pruning during pre-training — let the model learn which parameters to drop
  3. Knowledge distillation — train a smaller model from scratch with the large model as teacher
  4. Architecture co-design — design the model for compressibility from day one (vocabulary size, FFN ratio, attention head count)

Post-hoc compression of already-trained weights with "smart math" may simply have hit its limit at int4.

7. Practical Lessons for Compression Research

  1. Run an SVD before you design the pipeline. A single-layer singular value spectrum takes 10 minutes and tells you whether your entire approach is viable. We learned this the hard way after 28 hours.
  2. Theoretical compression ratios are fantasy without spectral validation. Every component's theoretical bound was 5–200x above reality because the rank assumptions were false. EYM is correct; the premise ("weights are low-rank") is wrong.
  3. Automatic pipelines need quality gates. Our system initially reported "24.00x compression" for matrices with error worse than the zero baseline. Adding mandatory per-task error reporting, theoretical-vs-actual comparison, and forward-pass smoke tests (cosine ≥ 0.95) caught every subsequent failure before it propagated downstream.
  4. The 3.9–4.0x ceiling is not an algorithmic limitation — it's an information-theoretic one. Int4 quantization is near-optimal not because we lack clever algorithms, but because modern LLM weights genuinely contain ~4 bits of non-redundant information per 16-bit parameter. Additional compression requires removing information, not just rearranging it.
  5. Embedding layers are the bottleneck no one talks about. In Qwen3.5-9B, the embedding and LM head eat 22.8% of storage. A 248K vocabulary at 4096 dimensions is 1 billion parameters that refuse any decomposition. Vocabulary design is a compression decision, not just a tokenization decision.

8. Reproducibility

  • Model: Qwen3.5-9B (Qwen/Qwen3.5-9B on HuggingFace)
  • Environment: 5090 32G Docker workers, Python 3.10, PyTorch 2.5.1
  • Evaluation: wikitext-2-raw-v1, 500-token perplexity
  • Quality gates enforced: SVD truncation error, quantization error, total reconstruction error (<1.0 baseline), theoretical bound comparison, forward cosine similarity (≥0.95), tensor shape match (775/775)
  • Task count: 53 tasks (39 completed, 14 failed), 28.6 hours, 281 wake cycles
  • Data available: Full SVD spectra for all FFN and attention layers, TT optimization sweep (7 configurations), per-layer compression reports, quality gate audit log

All artifacts and the compressed model checkpoint (standard safetensors) are archived and reproducible.

9. Open Questions

  • Llama 3/4, Mistral, DeepSeek — same flat spectrum? We'd bet yes, but someone needs to check. The methodology is straightforward: load one layer, run torch.linalg.svd(), plot the singular values. If anyone does this, please tag me.
  • Do vision-language models have different compressibility? The vision encoder in Qwen3.5-9B showed similar behavior, but a dedicated VLM study would be valuable.
  • Can we train models to be compressible? Rank regularization during training, vocabulary pruning during pre-training, or architectural constraints (bottleneck layers) could produce weights with genuine low-rank structure.
  • KV-cache compression may be the better target. We only touched weights. If activations and attention states show low-rank dynamics, that space might be more fertile than weight compression.

10. Next: Qwen3.5-27B — Bigger Model, Same Framework

We're not done. The next target is Qwen3.5-27B, using the same Aura pipeline and agent lineup. Here's the logic:

Why 27B matters

The 9B result left one question dangling: does the near-full-rank property scale with model size? Two competing hypotheses:

  • Hypothesis A (pessimistic): All modern LLMs, regardless of size, have near-full-rank weights. 27B gives the same result — int4 ceiling, nothing beyond.
  • Hypothesis B (optimistic): Larger models develop more structured representations. With 3x the parameters, the model has excess capacity — some dimensions truly become redundant. SVD might actually work.

We need to know which one is true. 27B is large enough to test this meaningfully without being prohibitively expensive.

What's different at 27B

Factor 9B 27B Implication
FFN layers 32 ~48–64 (TBD) More of the pie in FFN, less relative weight on Embedding
Embedding bottleneck (% of params) 22.8% ~8–10% (est.) The 5x wall may be gone
FFN intermediate dim 12,288 likely larger SVD spectrum could be different at higher internal dim
Full Attn / Linear Attn ratio 8/24 likely more full attn Attention becomes a larger component
Total unquantized size ~19.3 GB ~55 GB Compression payoff is much larger if we succeed

The key change from 9B: the embedding bottleneck should be proportionally smaller. At 9B, Embedding+LM Head = 22.8% of parameters, making 5x structurally impossible. At 27B, FFN should dominate the parameter count, and if FFN can be compressed beyond int4, the total ratio could break 5x even if the embedding stubbornly stays at ~4x.

Final Conclusion

We failed to beat int4. After 28.6 hours of compute, 53 orchestrated tasks, and an exhaustive search over SVD ranks and TT configurations, Qwen3.5-9B maxes out at 3.93x compression under quality constraints — functionally identical to any off-the-shelf 4-bit quantizer.

But the failure is informative. We now have direct experimental evidence that:

  1. Modern LLM weight matrices are near-full-rank — SVD-based compression is mathematically impossible at useful ratios.
  2. Large-vocabulary embedding layers are the hardest compression target in a transformer.
  3. ~4x (int4) is the practical ceiling for post-hoc, quality-preserving LLM weight compression.
  4. Beautiful theory doesn't survive contact with real weights. Eckart-Young-Mirsky, Tensor-Train, Shannon rate-distortion — all correct, all inapplicable when the singular values refuse to decay.

If you're thinking about compressing an LLM with anything fancier than int4: run the SVD spectrum first. It will either save you 28 hours, or reveal something genuinely interesting. Ours did the former.

Full data, code, and the compressed model are available on request. Questions and replication attempts welcome — especially SVD spectra from other model families.

FYI, aura agent:

https://github.com/erickong/aura-agent

reddit.com

Title suggestion

I built Aura Agent: a goal-driven supervisor for long-running coding tasks, with workers, watchdogs, and reflection loops

Hi everyone, I’m open-sourcing a project called Aura Agent:
https://github.com/erickong/aura-agent

Aura is a two-layer autonomous task orchestrator for long-running coding goals.

Instead of asking one chat session to do everything, Aura runs a persistent Layer 1 orchestrator that wakes up periodically, checks evidence, updates a task tree, and launches bounded Layer 2 worker processes through backends like claude_code or ds_code.

It also has a lightweight watchdog layer around the loop: it monitors worker processes, wakes the orchestrator early when something stops or when an external wake signal appears, and helps prevent long-running tasks from silently drifting forever.

我做了一个双层自主编程 Agent。不是单次聊天式 coding agent,而是一个会周期性醒来、检查文件证据、维护任务树、启动/杀掉 worker、记录决策历史的长期任务编排器。它还有 watchdog 机制,用来监督 worker、处理提前唤醒、发现进程停止等情况。

Why I built it

Claude Code and similar CLI agents are powerful, but for multi-hour / multi-day tasks I kept wanting a higher-level supervisor:

  • What has actually been completed?
  • Which worker is stuck?
  • Did a task produce real files, or just say “done”?
  • What decision changed the task state, and what evidence supported it?
  • Can the system keep iterating without losing context?
  • Is the short-term work still aligned with the long-term goal?

Aura is my answer to that.

Architecture

goal.md
  -&gt; Aura Orchestrator
       - persistent task tree
       - progress report
       - memory
       - evidence-based decisions
       - periodic review / reflection
  -&gt; Watchdog
       - monitors worker processes
       - wakes the orchestrator early
       - detects stopped or stuck workers
  -&gt; Layer 2 Workers
       - claude_code or ds_code
       - isolated workspace per task
       - result.md / logs / artifacts

Aura can run from any project directory, similar to Claude Code. The current directory becomes the project root, and runtime state goes into .aura/.

Global setup is separate:

aura setup

This writes global config to:

~/.aura/config.env
# or on Windows:
%USERPROFILE%\.aura\config.env

Then in any project:

aura start --task-file goal.md

Reflection loop

Aura also has a configurable reflection system. By default, it can run a deeper review about once per hour.

The reflection system asks questions like:

  • Is the current short-term goal still aligned with the long-term mission?
  • Are we optimizing the wrong thing?
  • Are workers producing real progress or just activity?
  • Should we continue, replan, decompose, or switch direction?
  • What lessons should be written into long-term memory?

This is important because long-running agents can easily become busy without being useful. I wanted Aura to not only “keep working”, but periodically step back and ask whether the work still makes sense.

中文补充:
这个反思系统大概每小时运行一次,可配置。它会检查当前短期执行方向是否还符合长期目标,是否需要重新规划,是否有任务只是看起来很忙但没有真实产出。

Why DeepSeek makes this more practical

One reason this kind of system is becoming realistic now is cost.

A persistent orchestrator wakes up many times, reads state, checks progress, starts workers, kills stuck tasks, and reviews direction. If every cycle is expensive, the architecture becomes hard to justify.

DeepSeek v4 being relatively cheap changes the equation. It makes it much more practical to run a long-horizon supervisor loop instead of treating every agent run as a precious one-shot interaction.

I don’t think cheap models automatically solve autonomy. But they do make it possible to build systems that can afford to inspect, retry, reflect, and iterate.

中文简单说:
幸亏 DeepSeek v4 这种模型价格相对便宜,这种“长期运行 + 周期性检查 + 反思 + 迭代”的系统才真正可行。不然每次 wake、检查、总结、重规划都太贵,最后很难长期跑。

Example experiment

One of my test missions was intentionally aggressive:

>Self-iterate and improve ds-code, compare it against Claude as a baseline, reduce tool failure rate below 5%, and keep iterating on around 10 representative complex coding tasks until the CLI is faster / more accurate / more reliable.

Important: API keys were redacted before publishing.

The original mission asked Aura to:

  • Optimize a local ds-code directory using DeepSeek v4-pro.
  • Improve CLI efficiency, accuracy, and speed compared with Claude.
  • Reduce tool failure rate below 5%.
  • Build a benchmark loop against Claude on representative complex tasks.
  • Keep iterating until the metrics are met.

After about 2.5 hours, Aura had produced this progress snapshot:

Wake cycles: 21
Completed tasks: 17
Active tasks: 3
Failed tasks: 0
Blocked tasks: 0
Replans: 0

It decomposed the mission into work like:

  • define 10 representative complex benchmark tasks
  • build a comparison runner for ds_code vs Claude
  • run baseline comparisons
  • analyze prompt bottlenecks
  • analyze tool failure modes
  • profile CLI speed bottlenecks
  • apply CLI speed quick wins
  • deploy an optimized DeepSeek system prompt
  • fix critical tool issues
  • build a tool reliability test suite
  • verify tool failure rate

The tool reliability check reported 0% failure in the tested suite, while the full Claude-vs-ds-code benchmark was still running.

So I’m not claiming “it beat Claude” yet. The point is that Aura kept the experiment structured, measurable, and auditable instead of becoming a giant messy chat log.

中文补充:
这个例子里,是 Aura 自动把一个非常大的目标拆成可验证任务,持续运行 worker,记录每次状态变化的原因和证据,并在 worker 卡住时杀掉、重启或换策略。它更像一个长期项目经理 + 自动化执行监督器。

Compared with Hermes / OpenClaw / 小龙虾 style systems

Aura is inspired by self-evolving agent loops and task-ledger systems, but it is more conservative:

Area Aura approach
Concurrency Max 2 workers by default, quality over swarm size
State Persistent task tree + decision log
Completion Requires evidence from files/logs/artifacts
Watchdog Monitors workers and wakes the loop early
Reflection Periodic review of short-term direction vs long-term mission
Cost Small worker count, cached reads, compact context snapshots
Failure handling Worker health checks, stuck detection, state backups
Goal Long-running project completion, not just broad exploration

Glad to hear from you.

GitHub: https://github.com/erickong/aura-agent

u/Civil-Direction-6981 — 18 days ago