u/DrEric2026

Reduction of matrix sizes in signal processing and AI

What do you think about ways to reduce matrix dimensions to cut down memory usage on GPUs? I’ve put together a few options here. A framework that can map and combine these elements would be great and useful.

1. Classic signal processing: matrix reduction

1.1 Standard mathematical procedures

Singular Value Decomposition (SVD)

  • Decomposition: A = UΣVᵀ
  • Truncation of small singular values → rank k approximation
  • Optimal approximation in the Frobenius norm sense (Eckart-Young theorem)
  • Memory reduction from m×n to k(m+n+1)

Principal Component Analysis (PCA)

  • Dimensional reduction by projection onto the directions of maximum variance
  • Closely related to SVD of centered data matrix

QR decomposition with rank revelation (Rank-Revealing QR)

  • Identification of numerically irrelevant columns/rows

Random Projections (Johnson-Lindenstrauss)

  • Projection into low-dimensional space with approximate distance preservation
  • Extremely efficient, theoretically sound

Sparse Coding / Dictionary Learning

  • Representation as a thin linear combination of basis vectors
  • A ≈ D X with ||X||₀ minimal

2. Established compression methods for AI weight matrices

2.1 Structural decompositions

Low Rank Factorization

  • Replace W (m×n) by W ≈ W₁·W₂ with W₁∈ℝᵐˣʳ, W₂∈ℝʳˣⁿ
  • Parameters: mn → r(m+n)
  • Variants: SVD-based, NMF (Non-negative Matrix Factorization)

Tensor Decomposition

  • CP decomposition, Tucker decomposition, tensor train
  • Particularly relevant for convolution kernels (4D tensors)
  • Example: A 3×3×512×512 kernel can be dramatically compressed

Kronecker product approximation

  • W ≈ A ⊗ B (Kronecker product of smaller matrices)
  • Extreme parameter reduction: mn → (m₁n₁ + m₂n₂) instead of m₁m₂×n₁n₂

Block Diagonal Structures

  • Force block diagonal shape → reduces interactions between groups
  • Related to "Grouped Convolutions"

2.2 Pruning (thinning)

Unstructured Pruning

  • Set individual weights to zero (magnitude pruning)
  • Lottery Ticket Hypothesis (Frankle & Carlin, 2019)
  • Storage as a sparse matrix (CSR, CSC, COO format)
  • Achievable: 90-99% sparsity with minimal loss of accuracy

Structured Pruning

  • Remove entire filters, channels, attention heads
  • More hardware friendly as regular die sizes are retained
  • Criteria: L1 norm, Taylor expansion, sensitivity analysis

Semi-structured pruning (N:M sparsity)

  • NVIDIA Ampere: 2:4 sparsity (2 out of 4 values are zero)
  • Direct hardware support, ~2× speedup

2.3 Quantization

Post Training Quantization (PTQ)

  • FP32 → FP16 → INT8 → INT4 → Binary
  • GPTQ, AWQ, SqueezeLLM
  • 4-bit quantization is now standard for LLMs

Quantization-Aware Training (QAT)

  • Training with simulated quantization
  • Straight-through estimator for gradients using non-differentiable rounding

Mixed Precision

  • Different shifts/operations with different precision
  • Sensitivity analysis determines optimal bit widths

Extreme quantization

  • Binary Networks (XNOR-Net): Weights ∈ {-1, +1}
  • Ternary Networks: Weights ∈ {-1, 0, +1}
  • Matrix multiplication becomes bit operations

Vector quantization

  • Weights are converted into codebook indices
  • Product Quantization: Division into sub-vectors
  • Example: 256 codebook entries → 8 bits per sub-vector

2.4 Knowledge Distillation

  • Large “teacher” network trains small “student” network
  • Student learns soft probability distributions (soft labels)
  • Effective: The information of the large matrix is transferred into a smaller one

2.5 Weight Sharing

Hash-based weight sharing

  • HashedNets: Hash function assigns many positions to the same weight
  • Drastic reduction of free parameters

Cluster-based weight sharing

  • k-means clustering of the weights
  • Storage: Codebook + Index Matrix
  • Deep Compression (Han et al., 2016): Pruning + Quantization + Huffman Coding

3. Modern and advanced procedures

3.1 Low-Rank Adaptation (LoRA) and variants

LoRA

  • W = W₀ + ΔW = W₀ + BA with B∈ℝᵐˣʳ, A∈ℝʳˣⁿ
  • Pre-trained weights remain frozen
  • Only r(m+n) trainable parameters instead of mn
  • Typically: r = 4-64 for matrices with m,n > 4096

QLoRA

  • Combination: 4-bit quantized base weights + LoRA adapter
  • Allows fine-tuning of 65B models on a GPU

DoRA (Weight-Decomposed Low-Rank Adaptation)

  • Decomposition into magnitude and direction

AdaLoRA

  • Adaptive rank per shift based on importance

3.2 Structured State Space Models (as an architectural alternative)

  • Mamba and similar architectures
  • Replace dense attention matrices with structured, efficient state space models
  • Implicit compression through different calculation structure

3.3 Mixture of Experts (MoE)

  • Not all weights are activated for every input
  • Effective "conditional" matrix reduction
  • Example: Mixtral 8×7B has 47B parameters, but only uses ~13B per token

4. Conceivable / Speculative Compression Methods

This is where things get particularly interesting. The following approaches are partly in early phases of research, partly purely conceptual:

4.1 Information theoretical approaches

Kolmogorov complexity-inspired compression

  • Store weight matrices not as numbers, but as programs that generate them
  • A matrix with 1 billion parameters could be described by a small program + seed
  • Related to "hypernetworks" that generate weights

Minimum Description Length (MDL) as a training goal

  • Not only optimize loss, but at the same time minimize the description length of the model
  • Automatically results in compressible weight structures

Rate Distortion Theoretical Optimization

  • Formalization: Find the compression that produces the minimum quality loss for a given bitrate budget
  • Could be optimized layer by layer or globally

4.2 Generative weight compression

Implicit Neural Representations (INR) for weights

  • Instead of storing weights explicitly, train a small neural network that generates the weights of the large network as a function of position
  • W(i,j) = f_θ(i,j) with θ ≪ m n
  • First work: "Neural Network Bundling" (partially researched)

Fractal / self-similar structures

  • Observation: Weight matrices from different layers often show similar statistical patterns
  • Save a "base pattern" + transformation rules
  • Biologically inspired: DNA encodes trillions of synapses with only ~20,000 genes

Procedural weight generation

  • Weights are generated from a few parameters using deterministic algorithms
  • Similar to procedural texture generation in computer graphics
  • Potential: Extreme compression rates if weight structures are sufficiently regular

4.3 Algebraic structure exploitation

Circulant and Toeplitz matrices

  • Storage in O(n) instead of O(n²)
  • Multiplication via FFT in O(n log n)
  • Enforce this structure when training → "Structured Efficient Linear Layers"
  • Already partially researched, but not widely used

Butterfly Matrices / Kaleidoscope Matrices

  • Factorization into products of thin, structured matrices
  • Generalization of FFT-like structures
  • W = B₁ · B₂ · ... · Bₖ with only O(n) non-zero entries each
  • Parameter reduction: O(n²) → O(n log n)
  • Monarch Matrices (Dao et al., 2022) are a concrete example

Wavelet-based decomposition

  • Application of wavelet transforms to weight matrices
  • Preservation of multiscale structures
  • Thresholding of small coefficients → natural sparsity

4.4 Algebraic geometry and manifold approaches

Weights on low dimensional manifolds

  • Hypothesis: All "good" weight configurations lie on a low-dimensional manifold in parameter space
  • Learn the variety, not the individual points
  • Parameterization using a few intrinsic coordinates

Lie group parameterization

  • Orthogonal/unitary weight matrices as exponential mapping
  • W = exp(A) with A antisymmetric
  • Reduction from n² to n(n-1)/2 parameters + structural advantages

4.5 Biologically Inspired Approaches

Genomic compression

  • The human brain has ~100 trillion synapses, encoded by ~750MB of DNA
  • Compression ratio: ~1:10,000,000
  • Principle: Don't store the weights, but rather the rules that create weights
  • "Developmental Neural Networks": growth rules instead of explicit weights

Hebbian Reconstruction

  • Save only the learning rule + training data statistics
  • Reconstruct weights if necessary
  • Extreme compression, but high computational effort for reconstruction

4.6 Quantum mechanics-inspired approaches

Tensor network methods (MPS, PEPS, MERA)

  • From quantum physics: Matrix Product States
  • Weight tensor is represented as a chain of small tensors
  • Controllable accuracy via “bond dimension”
  • Particularly effective with weights with limited “twist”

Holographic Compression

  • Inspired by the holographic principle: information about a volume is encoded on the surface
  • Speculative idea: describe 3D weight tensor by 2D representation at the "boundary".

4.7 Dynamic / Adaptive Compression

Input-dependent weight reconstruction

  • Save a compressed base
  • A slightly different set of weights is reconstructed for each input
  • Combination of compression and adaptivity

Progressive decompression

  • Similar to progressive JPEG
  • First bits give a rough approximation, further bits refine
  • Enables Anytime Inference: Better quality with more computing time

Neuromorphic Sparse Coding

  • Weights are encoded by spike timing patterns
  • Inherently compressed by temporal sparsity

4.8 Cryptography-inspired approaches

Pseudo-random weight generation

  • Many weight components are "random enough"
  • Save: Structured component + PRNG seed for the "random" component
  • W = W_structured + PRNG(seed, shape)
  • The structured component contains the “learned information”

4.9 Information geometric compression

Fisher Information Based Compression

  • Weights with low Fisher information contribute little to model performance
  • Compress more aggressively along directions of low curvature in the loss landscape
  • Theoretically optimal, practically complex to calculate

Natural Gradient Compression

  • Transform into the “natural” parameter space
  • There more even information distribution → more efficient quantization

5. Combination approaches

The strongest compression comes from a combination:

Deep Compression Pipeline (Han et al., extended):
┌──────────────┐
│ Training     │
└──────┬───────┘
       ↓
┌──────────────┐
│ Pruning      │ → 90% sparsity
└──────┬───────┘
       ↓
┌──────────────┐
│ Low-Rank     │ → Rank reduction
│ Factorization│
└──────┬───────┘
       ↓
┌──────────────┐
│ Quantization │ → 4-bit
└──────┬───────┘
       ↓
┌──────────────┐
│ Weight       │ → Codebook
│ Sharing      │
└──────┬───────┘
       ↓
┌──────────────┐
│ Entropy-     │ → Huffman/ANS
│ Coding       │
└──────────────┘

Total compression: 50-100×

reddit.com
u/DrEric2026 — 23 days ago