Reduction of matrix sizes in signal processing and AI

What do you think about ways to reduce matrix dimensions to cut down memory usage on GPUs? I’ve put together a few options here. A framework that can map and combine these elements would be great and useful.

1. Classic signal processing: matrix reduction

1.1 Standard mathematical procedures

Singular Value Decomposition (SVD)

Decomposition: A = UΣVᵀ
Truncation of small singular values → rank k approximation
Optimal approximation in the Frobenius norm sense (Eckart-Young theorem)
Memory reduction from m×n to k(m+n+1)

Principal Component Analysis (PCA)

Dimensional reduction by projection onto the directions of maximum variance
Closely related to SVD of centered data matrix

QR decomposition with rank revelation (Rank-Revealing QR)

Identification of numerically irrelevant columns/rows

Random Projections (Johnson-Lindenstrauss)

Projection into low-dimensional space with approximate distance preservation
Extremely efficient, theoretically sound

Sparse Coding / Dictionary Learning

Representation as a thin linear combination of basis vectors
A ≈ D X with ||X||₀ minimal

2. Established compression methods for AI weight matrices

2.1 Structural decompositions

Low Rank Factorization

Replace W (m×n) by W ≈ W₁·W₂ with W₁∈ℝᵐˣʳ, W₂∈ℝʳˣⁿ
Parameters: mn → r(m+n)
Variants: SVD-based, NMF (Non-negative Matrix Factorization)

Tensor Decomposition

CP decomposition, Tucker decomposition, tensor train
Particularly relevant for convolution kernels (4D tensors)
Example: A 3×3×512×512 kernel can be dramatically compressed

Kronecker product approximation

W ≈ A ⊗ B (Kronecker product of smaller matrices)
Extreme parameter reduction: mn → (m₁n₁ + m₂n₂) instead of m₁m₂×n₁n₂

Block Diagonal Structures

Force block diagonal shape → reduces interactions between groups
Related to "Grouped Convolutions"

2.2 Pruning (thinning)

Unstructured Pruning

Set individual weights to zero (magnitude pruning)
Lottery Ticket Hypothesis (Frankle & Carlin, 2019)
Storage as a sparse matrix (CSR, CSC, COO format)
Achievable: 90-99% sparsity with minimal loss of accuracy

Structured Pruning

Remove entire filters, channels, attention heads
More hardware friendly as regular die sizes are retained
Criteria: L1 norm, Taylor expansion, sensitivity analysis

Semi-structured pruning (N:M sparsity)

NVIDIA Ampere: 2:4 sparsity (2 out of 4 values are zero)
Direct hardware support, ~2× speedup

2.3 Quantization

Post Training Quantization (PTQ)

FP32 → FP16 → INT8 → INT4 → Binary
GPTQ, AWQ, SqueezeLLM
4-bit quantization is now standard for LLMs

Quantization-Aware Training (QAT)

Training with simulated quantization
Straight-through estimator for gradients using non-differentiable rounding

Mixed Precision

Different shifts/operations with different precision
Sensitivity analysis determines optimal bit widths

Extreme quantization

Binary Networks (XNOR-Net): Weights ∈ {-1, +1}
Ternary Networks: Weights ∈ {-1, 0, +1}
Matrix multiplication becomes bit operations

Vector quantization

Weights are converted into codebook indices
Product Quantization: Division into sub-vectors
Example: 256 codebook entries → 8 bits per sub-vector

2.4 Knowledge Distillation

Large “teacher” network trains small “student” network
Student learns soft probability distributions (soft labels)
Effective: The information of the large matrix is transferred into a smaller one

2.5 Weight Sharing

Hash-based weight sharing

HashedNets: Hash function assigns many positions to the same weight
Drastic reduction of free parameters

Cluster-based weight sharing

k-means clustering of the weights
Storage: Codebook + Index Matrix
Deep Compression (Han et al., 2016): Pruning + Quantization + Huffman Coding

3. Modern and advanced procedures

3.1 Low-Rank Adaptation (LoRA) and variants

LoRA

W = W₀ + ΔW = W₀ + BA with B∈ℝᵐˣʳ, A∈ℝʳˣⁿ
Pre-trained weights remain frozen
Only r(m+n) trainable parameters instead of mn
Typically: r = 4-64 for matrices with m,n > 4096

QLoRA

Combination: 4-bit quantized base weights + LoRA adapter
Allows fine-tuning of 65B models on a GPU

DoRA (Weight-Decomposed Low-Rank Adaptation)

Decomposition into magnitude and direction

AdaLoRA

Adaptive rank per shift based on importance

3.2 Structured State Space Models (as an architectural alternative)

Mamba and similar architectures
Replace dense attention matrices with structured, efficient state space models
Implicit compression through different calculation structure

3.3 Mixture of Experts (MoE)

Not all weights are activated for every input
Effective "conditional" matrix reduction
Example: Mixtral 8×7B has 47B parameters, but only uses ~13B per token

4. Conceivable / Speculative Compression Methods

This is where things get particularly interesting. The following approaches are partly in early phases of research, partly purely conceptual:

4.1 Information theoretical approaches

Kolmogorov complexity-inspired compression

Store weight matrices not as numbers, but as programs that generate them
A matrix with 1 billion parameters could be described by a small program + seed
Related to "hypernetworks" that generate weights

Minimum Description Length (MDL) as a training goal

Not only optimize loss, but at the same time minimize the description length of the model
Automatically results in compressible weight structures

Rate Distortion Theoretical Optimization

Formalization: Find the compression that produces the minimum quality loss for a given bitrate budget
Could be optimized layer by layer or globally

4.2 Generative weight compression

Implicit Neural Representations (INR) for weights

Instead of storing weights explicitly, train a small neural network that generates the weights of the large network as a function of position
W(i,j) = f_θ(i,j) with θ ≪ m n
First work: "Neural Network Bundling" (partially researched)

Fractal / self-similar structures

Observation: Weight matrices from different layers often show similar statistical patterns
Save a "base pattern" + transformation rules
Biologically inspired: DNA encodes trillions of synapses with only ~20,000 genes

Procedural weight generation

Weights are generated from a few parameters using deterministic algorithms
Similar to procedural texture generation in computer graphics
Potential: Extreme compression rates if weight structures are sufficiently regular

4.3 Algebraic structure exploitation

Circulant and Toeplitz matrices

Storage in O(n) instead of O(n²)
Multiplication via FFT in O(n log n)
Enforce this structure when training → "Structured Efficient Linear Layers"
Already partially researched, but not widely used

Butterfly Matrices / Kaleidoscope Matrices

Factorization into products of thin, structured matrices
Generalization of FFT-like structures
W = B₁ · B₂ · ... · Bₖ with only O(n) non-zero entries each
Parameter reduction: O(n²) → O(n log n)
Monarch Matrices (Dao et al., 2022) are a concrete example

Wavelet-based decomposition

Application of wavelet transforms to weight matrices
Preservation of multiscale structures
Thresholding of small coefficients → natural sparsity

4.4 Algebraic geometry and manifold approaches

Weights on low dimensional manifolds

Hypothesis: All "good" weight configurations lie on a low-dimensional manifold in parameter space
Learn the variety, not the individual points
Parameterization using a few intrinsic coordinates

Lie group parameterization

Orthogonal/unitary weight matrices as exponential mapping
W = exp(A) with A antisymmetric
Reduction from n² to n(n-1)/2 parameters + structural advantages

4.5 Biologically Inspired Approaches

Genomic compression

The human brain has ~100 trillion synapses, encoded by ~750MB of DNA
Compression ratio: ~1:10,000,000
Principle: Don't store the weights, but rather the rules that create weights
"Developmental Neural Networks": growth rules instead of explicit weights

Hebbian Reconstruction

Save only the learning rule + training data statistics
Reconstruct weights if necessary
Extreme compression, but high computational effort for reconstruction

4.6 Quantum mechanics-inspired approaches

Tensor network methods (MPS, PEPS, MERA)

From quantum physics: Matrix Product States
Weight tensor is represented as a chain of small tensors
Controllable accuracy via “bond dimension”
Particularly effective with weights with limited “twist”

Holographic Compression

Inspired by the holographic principle: information about a volume is encoded on the surface
Speculative idea: describe 3D weight tensor by 2D representation at the "boundary".

4.7 Dynamic / Adaptive Compression

Input-dependent weight reconstruction

Save a compressed base
A slightly different set of weights is reconstructed for each input
Combination of compression and adaptivity

Progressive decompression

Similar to progressive JPEG
First bits give a rough approximation, further bits refine
Enables Anytime Inference: Better quality with more computing time

Neuromorphic Sparse Coding

Weights are encoded by spike timing patterns
Inherently compressed by temporal sparsity

4.8 Cryptography-inspired approaches

Pseudo-random weight generation

Many weight components are "random enough"
Save: Structured component + PRNG seed for the "random" component
W = W_structured + PRNG(seed, shape)
The structured component contains the “learned information”

4.9 Information geometric compression

Fisher Information Based Compression

Weights with low Fisher information contribute little to model performance
Compress more aggressively along directions of low curvature in the loss landscape
Theoretically optimal, practically complex to calculate

Natural Gradient Compression

Transform into the “natural” parameter space
There more even information distribution → more efficient quantization

5. Combination approaches

The strongest compression comes from a combination:

Deep Compression Pipeline (Han et al., extended):
┌──────────────┐
│ Training     │
└──────┬───────┘
       ↓
┌──────────────┐
│ Pruning      │ → 90% sparsity
└──────┬───────┘
       ↓
┌──────────────┐
│ Low-Rank     │ → Rank reduction
│ Factorization│
└──────┬───────┘
       ↓
┌──────────────┐
│ Quantization │ → 4-bit
└──────┬───────┘
       ↓
┌──────────────┐
│ Weight       │ → Codebook
│ Sharing      │
└──────┬───────┘
       ↓
┌──────────────┐
│ Entropy-     │ → Huffman/ANS
│ Coding       │
└──────────────┘

Total compression: 50-100×

u/DrEric2026