u/Friendly-Way-9318

Prototyping a sub-quadratic architecture with double-sided Lie updates to fix the non-Abelian tracking limits of diagonal SSMs

​

If you’ve been following the recent literature on State Space Models (like the Merrill et al. paper from a couple of months back), there is a well-documented mathematical wall we are hitting: diagonal SSMs like Mamba are fundamentally trapped in commutative, Abelian update dynamics. At finite precision, they just can't track non-Abelian sequence transformations or complex spatial permutations cleanly.

Linear attention variants try to fix this by expanding state capacity to a flat matrix, but relying on standard scalar or vector decay gates heavily limits their selective retrieval dynamics and expressive capacity.

For the past few weeks, I’ve been prototyping an alternative paradigm to bridge this gap: Double-Sided Non-Abelian State Tracking.

The high-level intuition is to break past the row-locked or column-locked representation limits of single-sided updates (which you see in recent quaternion setups like Q-Mamba). Instead, my architecture utilizes bi-directional Lie group actions to independently rotate *both* the rows and columns of highly localized matrix sub-spaces.

This unlocks the complete rotational degrees of freedom over the hidden state, but here is the best part: the binary operator for the transitions remains strictly associative. It fully preserves the parallel prefix scan properties, keeping space and time complexity scaling linearly at O(Nd).

The Current Bottleneck:

While the forward pass math checks out beautifully, training this thing is an absolute beast. Because the manifold is non-Abelian and non-convex, forcing standard Euclidean optimizers like AdamW to predict these transitions creates severe coordinate drift and gradient friction during backpropagation. The training run currently stalls out during deeper convergence phases.

I'm curious if anyone else here is experimenting with non-Abelian group recurrences or geometric deep learning for sequence models. How are you handling the optimization friction? Are you looking into explicit Lie algebra mappings (like Cayley-map parameterizations) or Riemannian gradient corrections to keep the backward pass stable?

Would love to get some thoughts from anyone building past the commutative limit.

reddit.com
u/Friendly-Way-9318 — 18 hours ago

Prototyping a sub-quadratic architecture with double-sided Lie updates to fix the non-Abelian tracking limits of diagonal SSMs

​

If you’ve been following the recent literature on State Space Models (like the Merrill et al. paper from a couple of months back), there is a well-documented mathematical wall we are hitting: diagonal SSMs like Mamba are fundamentally trapped in commutative, Abelian update dynamics. At finite precision, they just can't track non-Abelian sequence transformations or complex spatial permutations cleanly.

Linear attention variants try to fix this by expanding state capacity to a flat matrix, but relying on standard scalar or vector decay gates heavily limits their selective retrieval dynamics and expressive capacity.

For the past few weeks, I’ve been prototyping an alternative paradigm to bridge this gap: Double-Sided Non-Abelian State Tracking.

The high-level intuition is to break past the row-locked or column-locked representation limits of single-sided updates (which you see in recent quaternion setups like Q-Mamba). Instead, my architecture utilizes bi-directional Lie group actions to independently rotate *both* the rows and columns of highly localized matrix sub-spaces.

This unlocks the complete rotational degrees of freedom over the hidden state, but here is the best part: the binary operator for the transitions remains strictly associative. It fully preserves the parallel prefix scan properties, keeping space and time complexity scaling linearly at O(Nd).

The Current Bottleneck:

While the forward pass math checks out beautifully, training this thing is an absolute beast. Because the manifold is non-Abelian and non-convex, forcing standard Euclidean optimizers like AdamW to predict these transitions creates severe coordinate drift and gradient friction during backpropagation. The training run currently stalls out during deeper convergence phases.

I'm curious if anyone else here is experimenting with non-Abelian group recurrences or geometric deep learning for sequence models. How are you handling the optimization friction? Are you looking into explicit Lie algebra mappings (like Cayley-map parameterizations) or Riemannian gradient corrections to keep the backward pass stable?

Would love to get some thoughts from anyone building past the commutative limit.

reddit.com
u/Friendly-Way-9318 — 18 hours ago
▲ 2 r/CUDA

Prototyping a sub-quadratic architecture with double-sided Lie updates to fix the non-Abelian tracking limits of diagonal SSMs

​

If you’ve been following the recent literature on State Space Models (like the Merrill et al. paper from a couple of months back), there is a well-documented mathematical wall we are hitting: diagonal SSMs like Mamba are fundamentally trapped in commutative, Abelian update dynamics. At finite precision, they just can't track non-Abelian sequence transformations or complex spatial permutations cleanly.

Linear attention variants try to fix this by expanding state capacity to a flat matrix, but relying on standard scalar or vector decay gates heavily limits their selective retrieval dynamics and expressive capacity.

For the past few weeks, I’ve been prototyping an alternative paradigm to bridge this gap: Double-Sided Non-Abelian State Tracking.

The high-level intuition is to break past the row-locked or column-locked representation limits of single-sided updates (which you see in recent quaternion setups like Q-Mamba). Instead, my architecture utilizes bi-directional Lie group actions to independently rotate *both* the rows and columns of highly localized matrix sub-spaces.

This unlocks the complete rotational degrees of freedom over the hidden state, but here is the best part: the binary operator for the transitions remains strictly associative. It fully preserves the parallel prefix scan properties, keeping space and time complexity scaling linearly at O(Nd).

The Current Bottleneck:

While the forward pass math checks out beautifully, training this thing is an absolute beast. Because the manifold is non-Abelian and non-convex, forcing standard Euclidean optimizers like AdamW to predict these transitions creates severe coordinate drift and gradient friction during backpropagation. The training run currently stalls out during deeper convergence phases.

I'm curious if anyone else here is experimenting with non-Abelian group recurrences or geometric deep learning for sequence models. How are you handling the optimization friction? Are you looking into explicit Lie algebra mappings (like Cayley-map parameterizations) or Riemannian gradient corrections to keep the backward pass stable?

Would love to get some thoughts from anyone building past the commutative limit.

reddit.com
u/Friendly-Way-9318 — 18 hours ago
▲ 2 r/ChatGPTPro+1 crossposts

I have ChatGPT business plan. I asked around 10 questions. And now it show a month worth of Cool down? Someone tell me what's happening

u/Friendly-Way-9318 — 25 days ago