We Spent 28 Hours Trying to Beat Int4 on Qwen3.5-9B Using Spectral Decomposition. The Weights Are Near-Full-Rank.
How this was done: The entire research project was supervised and driven by Aura, an open-source multi-agent research framework. Aura decomposes research tasks into a dependency tree, spawns workers, and verifies results. For this project, the agent lineup was: DeepSeek V4 as the orchestrator (task planning, quality gate verification, replan decisions), Codex (OpenAI) as Pro workers (heavy compute: SVD on real weights, model assembly, final evaluation), and Qwen3.5-27B running locally as Flash workers (light compute: dependency installs, model downloads, scripting). All 53 tasks — including the 14 failures and subsequent rebuilds — were managed automatically by Aura's scheduling and verification loop. No human touched the pipeline during execution.
Bottom Line Up Front
What we attempted: A hybrid compression framework (SVD low-rank + Tensor-Train + 4-bit quantization) targeting 5x compression over FP16 — aiming to beat standard int4 by ~25%.
What we found: Qwen3.5-9B's weight matrices are near-full-rank. Every "smart" decomposition — SVD, Tensor-Train, kernel methods — collapsed to the same ~4x ceiling. The final 3.93x result is mathematically equivalent to plain int4 quantization. After 28.6 hours, 53 tasks, and 14 failures, we proved that standard int4 is optimal under realistic quality constraints for this model.
The headline number: 3.93x compression, ppl=10.60 (baseline 6.71, +58%). No better than GPTQ. No better than AWQ. No better than bitsandbytes. The 5x target is structurally impossible — blocked by the embedding layer's irreducible information content.
Is this a failure? As a compression result, yes. As a research finding, no. We now know with experimental certainty that post-hoc low-rank methods cannot beat int4 on this generation of LLM weights. That's worth publishing.
0. A Quick Note on Qwen3.5-9B
Before diving into the compression story, a few words about the model itself — because its architecture matters for understanding why it resists compression.
Qwen3.5-9B is not a vanilla dense transformer. It has a hybrid attention architecture: 8 layers of full self-attention followed by 24 layers of linear attention (DeltaNet-style). This is Alibaba's bet on efficient long-context inference — replace the O(n²) quadratic attention with O(n) linear attention in most layers.
Notable architectural choices:
| Feature | Qwen3.5-9B | Llama 3.1 8B (for reference) |
|---|---|---|
| Vocabulary | 248,320 | 128,000 |
| Hidden dim | 4,096 | 4,096 |
| FFN intermediate | 12,288 (3.0x) | 14,336 (3.5x) |
| Attention heads (Q) | 16 | 32 |
| KV heads (GQA) | 4 | 8 |
| Full attn layers | 8 | 32 |
| Linear attn layers | 24 | 0 |
| Vision encoder | Yes (ViT) | No |
Three things stand out for compression:
- The vocabulary is massive. 248K tokens at 4096 dimensions = 2.03 GB for the embedding alone. That's 11.4% of the model in two tensors. For comparison, Llama 3.1 8B's embedding is 128K × 4096 = 1.05 GB — half the size. This single design decision makes 5x compression nearly impossible before you touch a single FFN weight.
- The 3.0x FFN expansion ratio is on the aggressive side. With intermediate_size=12288 (vs hidden=4096), each FFN layer's gate/up projections are [12288, 4096] — wide, fat matrices. This is where 54.2% of the parameters live. If these matrices had low-rank structure, the payoff would be enormous. (Spoiler: they don't.)
- The hybrid attention design creates two distinct "species" of weights. The 8 full-attention layers and 24 linear-attention layers have different projection structures — different shapes, different roles, different sensitivity to perturbation. A uniform compression strategy is unlikely to be optimal across both.
The model's FP16 perplexity on wikitext-2 is 6.71 — solid for a 9B model, roughly on par with Llama 3 8B. The hybrid architecture seems to work: you get long-context efficiency without sacrificing short-context quality. But the architectural choices that make it a good model also make it a compression researcher's nightmare.
1. The Ambitious Plan: SAHC Framework
We set out to beat standard 4-bit quantization (~4x compression) using what we called Spectrum-Aware Hybrid Compression (SAHC). The idea: different transformer components have different spectral structure, so compress them differently.
Architecture Breakdown & Attack Plan
| Component | % of Params | Method | Theoretical Ratio |
|---|---|---|---|
| C1: FFN (gate/up/down) | 54.2% | SVD low-rank + 4-bit quant | 25.3x |
| C2: Full Attention (8 layers) | 3.8% | EYM low-rank + head pruning + 4-bit | 8.0x |
| C3: Linear Attention (24 layers) | 11.3% | Kernel SVD + value pruning + 4-bit | 12.6x |
| C4: Emb + LM Head | 22.8% | Tensor-Train + SVD + 4-bit | 20.5x |
| C5: Vision + MTP | 7.9% | Standard 4-bit | ~4x |
Projected total: 10–13x compression with controlled perplexity degradation.
The Theory
Eckart-Young-Mirsky Theorem. For any real matrix W, the optimal rank-r approximation under Frobenius norm is given by truncated SVD:
W_r = U_r Σ_r V_r^T
with error:
||W - W_r||_F / ||W||_F = sqrt( Σ_{i=r+1}^{d} σ_i² / Σ_{i=1}^{d} σ_i² )
The key variable is the singular value decay rate. For exponential decay (σ_i ∝ e^({-λi}),) rank r=512 captures >99.9% of the Frobenius norm even at d=4096. For flat decay, you capture essentially nothing. Everything depends on what the actual weights look like.
Tensor-Train (TT) Decomposition. For the 248,320 × 4,096 embedding matrix, we reshape into a high-order tensor and factorize across modes:
W[i, j] ≈ G₁[i, r₁] · G₂[r₁, r₂] · G₃[r₂, j]
With TT-ranks [1, 32, 32, 1] and an SVD core at rank 1,024, the theoretical compression is ~175x with error bound ε ≤ 0.017. The factorization exploits the fact that vocabulary tokens should be distinguishable in far fewer than 4,096 dimensions — or so the theory goes.
Shannon Rate-Distortion. Information-theoretic lower bounds gave three regimes:
- Independent coding (int4): ~3.3 bits/weight → ~4.8x
- Joint coding (cross-layer correlation): ~1.0–2.0 bits/weight → ~8–16x
- Spectral coding (low-rank structure): ~0.1–0.3 bits/weight → 50–178x
The math was clean. The bounds were tight. Everything was provable on paper.
We were wrong.
2. The First Red Flag: SVD Error 21x Worse Than Theory
We compressed all 32 FFN layers using SVD rank-512 + 4-bit quantization:
Compression ratio: 24.00x (logical) / 7.90x (actual with .pt overhead)
Mean SVD error: 0.834
Mean total error: 0.982
To put those numbers in perspective:
- SVD error 0.83: the rank-512 approximation throws away more signal than noise
- Total error 0.98–1.02: in 7 of 32 layers, the error exceeded 1.0 — the compressed matrix is worse than replacing it with zeros. A zero matrix has error = 1.0 by definition. We achieved worse than nothing.
- Theoretical bound: Eckart-Young-Mirsky predicted ε ≤ 0.039 at r=512
That's a 21x gap. The automated pipeline didn't flag it — it just reported "24.00x compression" and moved on to the next step. The compressed model's perplexity was infinity. Not "bad." Not "degraded." Literally inf — the token probabilities collapsed to zero for legitimate tokens.
The Root Cause
We ran a full SVD on FFN layer 15 to inspect the spectrum:
Singular values of gate_proj [12288, 4096]:
σ₁ = 4.65
σ₂ = 4.05
σ₃ = 3.65
σ₄ = 3.55
...
σ₅₁₂ = 2.03 ← 44% of σ₁ — barely any decay
σ₁₀₂₄ = 1.72 ← 37% of σ₁
σ₂₀₄₈ = 1.18 ← 25% of σ₁
σ₄₀₉₆ = 0.95 ← 20% of σ₁ — the smallest is still substantial
The spectrum is essentially flat. Condition number σ_max/σ_min ≈ 4.9. An exponentially-decaying matrix might have σ_max/σ_min ≈ 10⁶ — that's the regime where SVD compression works.
Minimum rank needed for SVD error < 0.10:
- gate_proj: 3,794 / 4,096 → 92.6% of full rank
- up_proj: 3,814 / 4,096 → 93.1% of full rank
- down_proj: 3,755 / 4,096 → 91.7% of full rank
These matrices are near-full-rank. SVD truncation at any meaningful compression level is mathematically guaranteed to fail.
This is a structural property of the weights, not an implementation bug. The Qwen3.5 training process produced FFN matrices where every dimension carries non-negligible information. The SWiGLU activation with 3.0x expansion likely contributes — with that much capacity, the model uses every degree of freedom.
3. Attention Was Even Worse
The same near-full-rank property holds across all attention projections (Q, K, V, O) in all 32 layers. But attention had an additional failure mode.
Even full-rank SVD + int4 (i.e., keeping all 4,096 singular values, only quantizing the factors) failed our forward-pass smoke test. Cosine similarity between original and reconstructed layer outputs fell below 0.95 for attention layers. Why?
The reconstruction error propagates through the multi-head attention computation differently than through an FFN. In FFN, the error is additive per-layer. In attention, the Q·K^(T) softmax amplifies small perturbations through the exponential, and the weighted sum over V compounds them across heads. The same Frobenius error budget that passes in an FFN causes divergence in attention.
Conclusion: SVD-based compression is unusable for Qwen3.5-9B attention layers. Direct weight-only int4 was the sole surviving option.
4. The Embedding Wall
The embedding matrix is 248,320 × 4,096. At FP16, that's 2.03 GB. The LM head is another 2.03 GB (tied weights in this architecture, so they mirror each other). Together they're 22.8% of the model.
Simple arithmetic: to reach 5x total compression when FFN and Attention are capped at ~4x by int4, Embedding + LM Head must deliver ≥100x compression. That's the only path.
We ran a sweep of 7 TT rank configurations:
TT Config Comp. Ratio Cosine Quality Gate (cos ≥ 0.95)
─────────────────────────────────────────────────────────────────────────
[1, 8, 8, 1] 1,424x 0.761 ❌ FAIL
[1, 16, 16, 1] 389x 0.815 ❌ FAIL
[1, 32, 32, 1] 113x 0.857 ❌ FAIL
[1, 64, 64, 1] 35x 0.895 ❌ FAIL
[1, 128, 128, 1] 12x 0.923 ❌ FAIL
[1, 256, 256, 1] 5x 0.941 ❌ FAIL ← only 0.009 away
[1, 512, 4096, 1] 3.92x 0.990 ✅ PASS ← full rank
Every configuration above 100x had cosine below 0.86 — catastrophic degradation. Even the [1,256,256,1] config at just 4.85x missed the gate by 0.009. Only full-rank (SVD core at 4,096) passed, giving 3.92x — indistinguishable from direct int4.
Why? Because 248,320 vocabulary tokens genuinely need nearly all 4,096 embedding dimensions to be distinguishable. The embedding is not learning a manifold of dimension ≪ 4096 — it's using the full space. This is likely a consequence of the multilingual, multi-script vocabulary (Chinese characters, Latin, code tokens, special symbols all coexist) plus the vision tokens from the ViT encoder. The semantic space is genuinely high-dimensional.
This is the structural blocker. The embedding doesn't compress because it can't compress — not under any decomposition — because its information content is genuinely ~4 bits per parameter. Qwen's decision to use a 248K vocabulary makes this worse than it would be for a comparable model with a 128K vocabulary, but the fundamental issue (no low-rank structure in learned embeddings) likely applies broadly.
5. Final Results
After abandoning SVD and TT decomposition, the SAHC framework collapsed to its baseline:
| Component | Final Method | Compression |
|---|---|---|
| FFN (54.2%) | Row-wise direct int4 | 3.99x |
| Attention (15.1%) | Row-wise direct int4 | 3.99x |
| Emb / LM Head (22.8%) | Full-rank TT + int4 | 3.92x |
| Vision + MTP + Norm (7.9%) | Row-wise int4 | 3.13x–3.89x |
| Total | 3.93x |
Perplexity: 10.60 (baseline 6.71, +58%). Acceptable under our ppl < 100 threshold, but a clear degradation from int4 quantization noise.
Every component converged to the same ~4x ratio. The methods differ in name (TT+SVD+int4 vs direct int4), but the information retained and discarded is mathematically identical — 4 bits per parameter in all cases.
6. What This Tells Us About Qwen3.5-9B (and Modern LLMs Generally)
The Model Itself
Qwen3.5-9B is architecturally interesting. The hybrid full/linear attention split (8+24) is a genuine innovation — it's not just "another dense transformer." The model achieves competitive perplexity (6.71 on wikitext-2) with a design optimized for long-context efficiency. For practitioners, it's a solid choice in the 9B class, particularly for tasks that benefit from long context windows where the linear attention layers shine.
But from a compression perspective, the architecture has structural anti-compression properties:
- The 248K vocabulary is a hard blocker. It's 1.94x larger than Llama 3's 128K. Every byte of compression you gain elsewhere is diluted by these two immovable 2GB tensors. If Alibaba released a Qwen variant with a 128K vocabulary (still large for a multilingual model), the 5x target might actually be reachable.
- The 3.0x FFN expansion creates matrices where rank reduction is impossible. With more intermediate dimensions than necessary, the model fills them all. A smaller expansion (2.5x or 2.0x) would produce matrices with genuine low-rank structure.
- The linear attention layers compress differently from full attention, but both end up at the same ceiling. The architecture is diverse, but the irreducible information content per weight is uniform (~4 bits).
The Bigger Picture
Our bet is that this near-full-rank property is not unique to Qwen. Modern LLMs are trained with:
- AdamW optimizers that don't impose rank regularization
- Large learning rate schedules that explore full parameter space
- Massive data budgets that saturate model capacity
- No structural bottleneck that would force low-rank representations
Every indication suggests that Llama, Mistral, DeepSeek, and other modern LLMs have similarly incompressible weights. If you're working on post-hoc LLM compression, run an SVD on one layer first. Ten minutes of compute will tell you whether your method has any chance.
The most productive compression research directions for modern LLMs are likely:
- Quantization-aware training (QAT) — bake low-bit representations into the training process
- Pruning during pre-training — let the model learn which parameters to drop
- Knowledge distillation — train a smaller model from scratch with the large model as teacher
- Architecture co-design — design the model for compressibility from day one (vocabulary size, FFN ratio, attention head count)
Post-hoc compression of already-trained weights with "smart math" may simply have hit its limit at int4.
7. Practical Lessons for Compression Research
- Run an SVD before you design the pipeline. A single-layer singular value spectrum takes 10 minutes and tells you whether your entire approach is viable. We learned this the hard way after 28 hours.
- Theoretical compression ratios are fantasy without spectral validation. Every component's theoretical bound was 5–200x above reality because the rank assumptions were false. EYM is correct; the premise ("weights are low-rank") is wrong.
- Automatic pipelines need quality gates. Our system initially reported "24.00x compression" for matrices with error worse than the zero baseline. Adding mandatory per-task error reporting, theoretical-vs-actual comparison, and forward-pass smoke tests (cosine ≥ 0.95) caught every subsequent failure before it propagated downstream.
- The 3.9–4.0x ceiling is not an algorithmic limitation — it's an information-theoretic one. Int4 quantization is near-optimal not because we lack clever algorithms, but because modern LLM weights genuinely contain ~4 bits of non-redundant information per 16-bit parameter. Additional compression requires removing information, not just rearranging it.
- Embedding layers are the bottleneck no one talks about. In Qwen3.5-9B, the embedding and LM head eat 22.8% of storage. A 248K vocabulary at 4096 dimensions is 1 billion parameters that refuse any decomposition. Vocabulary design is a compression decision, not just a tokenization decision.
8. Reproducibility
- Model: Qwen3.5-9B (Qwen/Qwen3.5-9B on HuggingFace)
- Environment: 5090 32G Docker workers, Python 3.10, PyTorch 2.5.1
- Evaluation: wikitext-2-raw-v1, 500-token perplexity
- Quality gates enforced: SVD truncation error, quantization error, total reconstruction error (<1.0 baseline), theoretical bound comparison, forward cosine similarity (≥0.95), tensor shape match (775/775)
- Task count: 53 tasks (39 completed, 14 failed), 28.6 hours, 281 wake cycles
- Data available: Full SVD spectra for all FFN and attention layers, TT optimization sweep (7 configurations), per-layer compression reports, quality gate audit log
All artifacts and the compressed model checkpoint (standard safetensors) are archived and reproducible.
9. Open Questions
- Llama 3/4, Mistral, DeepSeek — same flat spectrum? We'd bet yes, but someone needs to check. The methodology is straightforward: load one layer, run
torch.linalg.svd(), plot the singular values. If anyone does this, please tag me. - Do vision-language models have different compressibility? The vision encoder in Qwen3.5-9B showed similar behavior, but a dedicated VLM study would be valuable.
- Can we train models to be compressible? Rank regularization during training, vocabulary pruning during pre-training, or architectural constraints (bottleneck layers) could produce weights with genuine low-rank structure.
- KV-cache compression may be the better target. We only touched weights. If activations and attention states show low-rank dynamics, that space might be more fertile than weight compression.
10. Next: Qwen3.5-27B — Bigger Model, Same Framework
We're not done. The next target is Qwen3.5-27B, using the same Aura pipeline and agent lineup. Here's the logic:
Why 27B matters
The 9B result left one question dangling: does the near-full-rank property scale with model size? Two competing hypotheses:
- Hypothesis A (pessimistic): All modern LLMs, regardless of size, have near-full-rank weights. 27B gives the same result — int4 ceiling, nothing beyond.
- Hypothesis B (optimistic): Larger models develop more structured representations. With 3x the parameters, the model has excess capacity — some dimensions truly become redundant. SVD might actually work.
We need to know which one is true. 27B is large enough to test this meaningfully without being prohibitively expensive.
What's different at 27B
| Factor | 9B | 27B | Implication |
|---|---|---|---|
| FFN layers | 32 | ~48–64 (TBD) | More of the pie in FFN, less relative weight on Embedding |
| Embedding bottleneck (% of params) | 22.8% | ~8–10% (est.) | The 5x wall may be gone |
| FFN intermediate dim | 12,288 | likely larger | SVD spectrum could be different at higher internal dim |
| Full Attn / Linear Attn ratio | 8/24 | likely more full attn | Attention becomes a larger component |
| Total unquantized size | ~19.3 GB | ~55 GB | Compression payoff is much larger if we succeed |
The key change from 9B: the embedding bottleneck should be proportionally smaller. At 9B, Embedding+LM Head = 22.8% of parameters, making 5x structurally impossible. At 27B, FFN should dominate the parameter count, and if FFN can be compressed beyond int4, the total ratio could break 5x even if the embedding stubbornly stays at ~4x.
Final Conclusion
We failed to beat int4. After 28.6 hours of compute, 53 orchestrated tasks, and an exhaustive search over SVD ranks and TT configurations, Qwen3.5-9B maxes out at 3.93x compression under quality constraints — functionally identical to any off-the-shelf 4-bit quantizer.
But the failure is informative. We now have direct experimental evidence that:
- Modern LLM weight matrices are near-full-rank — SVD-based compression is mathematically impossible at useful ratios.
- Large-vocabulary embedding layers are the hardest compression target in a transformer.
- ~4x (int4) is the practical ceiling for post-hoc, quality-preserving LLM weight compression.
- Beautiful theory doesn't survive contact with real weights. Eckart-Young-Mirsky, Tensor-Train, Shannon rate-distortion — all correct, all inapplicable when the singular values refuse to decay.
If you're thinking about compressing an LLM with anything fancier than int4: run the SVD spectrum first. It will either save you 28 hours, or reveal something genuinely interesting. Ours did the former.
Full data, code, and the compressed model are available on request. Questions and replication attempts welcome — especially SVD spectra from other model families.
FYI, aura agent: