▲ 0 r/deeplearning
Reduction of matrix sizes in signal processing and AI
What do you think about ways to reduce matrix dimensions to cut down memory usage on GPUs? I’ve put together a few options here. A framework that can map and combine these elements would be great and useful.
1. Classic signal processing: matrix reduction
1.1 Standard mathematical procedures
Singular Value Decomposition (SVD)
- Decomposition: A = UΣVᵀ
- Truncation of small singular values → rank k approximation
- Optimal approximation in the Frobenius norm sense (Eckart-Young theorem)
- Memory reduction from m×n to k(m+n+1)
Principal Component Analysis (PCA)
- Dimensional reduction by projection onto the directions of maximum variance
- Closely related to SVD of centered data matrix
QR decomposition with rank revelation (Rank-Revealing QR)
- Identification of numerically irrelevant columns/rows
Random Projections (Johnson-Lindenstrauss)
- Projection into low-dimensional space with approximate distance preservation
- Extremely efficient, theoretically sound
Sparse Coding / Dictionary Learning
- Representation as a thin linear combination of basis vectors
- A ≈ D X with ||X||₀ minimal
2. Established compression methods for AI weight matrices
2.1 Structural decompositions
Low Rank Factorization
- Replace W (m×n) by W ≈ W₁·W₂ with W₁∈ℝᵐˣʳ, W₂∈ℝʳˣⁿ
- Parameters: mn → r(m+n)
- Variants: SVD-based, NMF (Non-negative Matrix Factorization)
Tensor Decomposition
- CP decomposition, Tucker decomposition, tensor train
- Particularly relevant for convolution kernels (4D tensors)
- Example: A 3×3×512×512 kernel can be dramatically compressed
Kronecker product approximation
- W ≈ A ⊗ B (Kronecker product of smaller matrices)
- Extreme parameter reduction: mn → (m₁n₁ + m₂n₂) instead of m₁m₂×n₁n₂
Block Diagonal Structures
- Force block diagonal shape → reduces interactions between groups
- Related to "Grouped Convolutions"
2.2 Pruning (thinning)
Unstructured Pruning
- Set individual weights to zero (magnitude pruning)
- Lottery Ticket Hypothesis (Frankle & Carlin, 2019)
- Storage as a sparse matrix (CSR, CSC, COO format)
- Achievable: 90-99% sparsity with minimal loss of accuracy
Structured Pruning
- Remove entire filters, channels, attention heads
- More hardware friendly as regular die sizes are retained
- Criteria: L1 norm, Taylor expansion, sensitivity analysis
Semi-structured pruning (N:M sparsity)
- NVIDIA Ampere: 2:4 sparsity (2 out of 4 values are zero)
- Direct hardware support, ~2× speedup
2.3 Quantization
Post Training Quantization (PTQ)
- FP32 → FP16 → INT8 → INT4 → Binary
- GPTQ, AWQ, SqueezeLLM
- 4-bit quantization is now standard for LLMs
Quantization-Aware Training (QAT)
- Training with simulated quantization
- Straight-through estimator for gradients using non-differentiable rounding
Mixed Precision
- Different shifts/operations with different precision
- Sensitivity analysis determines optimal bit widths
Extreme quantization
- Binary Networks (XNOR-Net): Weights ∈ {-1, +1}
- Ternary Networks: Weights ∈ {-1, 0, +1}
- Matrix multiplication becomes bit operations
Vector quantization
- Weights are converted into codebook indices
- Product Quantization: Division into sub-vectors
- Example: 256 codebook entries → 8 bits per sub-vector
2.4 Knowledge Distillation
- Large “teacher” network trains small “student” network
- Student learns soft probability distributions (soft labels)
- Effective: The information of the large matrix is transferred into a smaller one
2.5 Weight Sharing
Hash-based weight sharing
- HashedNets: Hash function assigns many positions to the same weight
- Drastic reduction of free parameters
Cluster-based weight sharing
- k-means clustering of the weights
- Storage: Codebook + Index Matrix
- Deep Compression (Han et al., 2016): Pruning + Quantization + Huffman Coding
3. Modern and advanced procedures
3.1 Low-Rank Adaptation (LoRA) and variants
LoRA
- W = W₀ + ΔW = W₀ + BA with B∈ℝᵐˣʳ, A∈ℝʳˣⁿ
- Pre-trained weights remain frozen
- Only r(m+n) trainable parameters instead of mn
- Typically: r = 4-64 for matrices with m,n > 4096
QLoRA
- Combination: 4-bit quantized base weights + LoRA adapter
- Allows fine-tuning of 65B models on a GPU
DoRA (Weight-Decomposed Low-Rank Adaptation)
- Decomposition into magnitude and direction
AdaLoRA
- Adaptive rank per shift based on importance
3.2 Structured State Space Models (as an architectural alternative)
- Mamba and similar architectures
- Replace dense attention matrices with structured, efficient state space models
- Implicit compression through different calculation structure
3.3 Mixture of Experts (MoE)
- Not all weights are activated for every input
- Effective "conditional" matrix reduction
- Example: Mixtral 8×7B has 47B parameters, but only uses ~13B per token
4. Conceivable / Speculative Compression Methods
This is where things get particularly interesting. The following approaches are partly in early phases of research, partly purely conceptual:
4.1 Information theoretical approaches
Kolmogorov complexity-inspired compression
- Store weight matrices not as numbers, but as programs that generate them
- A matrix with 1 billion parameters could be described by a small program + seed
- Related to "hypernetworks" that generate weights
Minimum Description Length (MDL) as a training goal
- Not only optimize loss, but at the same time minimize the description length of the model
- Automatically results in compressible weight structures
Rate Distortion Theoretical Optimization
- Formalization: Find the compression that produces the minimum quality loss for a given bitrate budget
- Could be optimized layer by layer or globally
4.2 Generative weight compression
Implicit Neural Representations (INR) for weights
- Instead of storing weights explicitly, train a small neural network that generates the weights of the large network as a function of position
- W(i,j) = f_θ(i,j) with θ ≪ m n
- First work: "Neural Network Bundling" (partially researched)
Fractal / self-similar structures
- Observation: Weight matrices from different layers often show similar statistical patterns
- Save a "base pattern" + transformation rules
- Biologically inspired: DNA encodes trillions of synapses with only ~20,000 genes
Procedural weight generation
- Weights are generated from a few parameters using deterministic algorithms
- Similar to procedural texture generation in computer graphics
- Potential: Extreme compression rates if weight structures are sufficiently regular
4.3 Algebraic structure exploitation
Circulant and Toeplitz matrices
- Storage in O(n) instead of O(n²)
- Multiplication via FFT in O(n log n)
- Enforce this structure when training → "Structured Efficient Linear Layers"
- Already partially researched, but not widely used
Butterfly Matrices / Kaleidoscope Matrices
- Factorization into products of thin, structured matrices
- Generalization of FFT-like structures
- W = B₁ · B₂ · ... · Bₖ with only O(n) non-zero entries each
- Parameter reduction: O(n²) → O(n log n)
- Monarch Matrices (Dao et al., 2022) are a concrete example
Wavelet-based decomposition
- Application of wavelet transforms to weight matrices
- Preservation of multiscale structures
- Thresholding of small coefficients → natural sparsity
4.4 Algebraic geometry and manifold approaches
Weights on low dimensional manifolds
- Hypothesis: All "good" weight configurations lie on a low-dimensional manifold in parameter space
- Learn the variety, not the individual points
- Parameterization using a few intrinsic coordinates
Lie group parameterization
- Orthogonal/unitary weight matrices as exponential mapping
- W = exp(A) with A antisymmetric
- Reduction from n² to n(n-1)/2 parameters + structural advantages
4.5 Biologically Inspired Approaches
Genomic compression
- The human brain has ~100 trillion synapses, encoded by ~750MB of DNA
- Compression ratio: ~1:10,000,000
- Principle: Don't store the weights, but rather the rules that create weights
- "Developmental Neural Networks": growth rules instead of explicit weights
Hebbian Reconstruction
- Save only the learning rule + training data statistics
- Reconstruct weights if necessary
- Extreme compression, but high computational effort for reconstruction
4.6 Quantum mechanics-inspired approaches
Tensor network methods (MPS, PEPS, MERA)
- From quantum physics: Matrix Product States
- Weight tensor is represented as a chain of small tensors
- Controllable accuracy via “bond dimension”
- Particularly effective with weights with limited “twist”
Holographic Compression
- Inspired by the holographic principle: information about a volume is encoded on the surface
- Speculative idea: describe 3D weight tensor by 2D representation at the "boundary".
4.7 Dynamic / Adaptive Compression
Input-dependent weight reconstruction
- Save a compressed base
- A slightly different set of weights is reconstructed for each input
- Combination of compression and adaptivity
Progressive decompression
- Similar to progressive JPEG
- First bits give a rough approximation, further bits refine
- Enables Anytime Inference: Better quality with more computing time
Neuromorphic Sparse Coding
- Weights are encoded by spike timing patterns
- Inherently compressed by temporal sparsity
4.8 Cryptography-inspired approaches
Pseudo-random weight generation
- Many weight components are "random enough"
- Save: Structured component + PRNG seed for the "random" component
- W = W_structured + PRNG(seed, shape)
- The structured component contains the “learned information”
4.9 Information geometric compression
Fisher Information Based Compression
- Weights with low Fisher information contribute little to model performance
- Compress more aggressively along directions of low curvature in the loss landscape
- Theoretically optimal, practically complex to calculate
Natural Gradient Compression
- Transform into the “natural” parameter space
- There more even information distribution → more efficient quantization
5. Combination approaches
The strongest compression comes from a combination:
Deep Compression Pipeline (Han et al., extended):
┌──────────────┐
│ Training │
└──────┬───────┘
↓
┌──────────────┐
│ Pruning │ → 90% sparsity
└──────┬───────┘
↓
┌──────────────┐
│ Low-Rank │ → Rank reduction
│ Factorization│
└──────┬───────┘
↓
┌──────────────┐
│ Quantization │ → 4-bit
└──────┬───────┘
↓
┌──────────────┐
│ Weight │ → Codebook
│ Sharing │
└──────┬───────┘
↓
┌──────────────┐
│ Entropy- │ → Huffman/ANS
│ Coding │
└──────────────┘
Total compression: 50-100×
u/DrEric2026 — 23 days ago