New result from the Volatile line of GENREG experiments. I figured out how to stack layers in a fully evolved, gradient-free model without losing performance, and actually gaining it. 71.9% to 87.12% test accuracy across four auto-stacked layers.

Why MNIST again?

I was deep into building GENREG-LLM, specifically the attention mechanism, when I hit a wall. Attention requires depth. Depth requires stacking. And stacking in a gradient-free evolutionary system isn't something you can just bolt on. The LLM work demanded a level of abstraction that I didn't have a reliable foundation for yet.

So I reverted to MNIST. Not because the LLM work failed, but because I needed to understand something more fundamental first: how the neurons themselves function, how layers actually compose, and what happens structurally when you stack evolved components on top of each other. You can't build a skyscraper if you don't understand the load-bearing behavior of individual beams.

There's a whole thread here I'll be posting about separately. Single neurons in GENREG do something interesting: they naturally freeze and soft-freeze in saturation regions without being told to. The neuron discovers its own operating boundaries through evolution. That post will cover the full single-neuron capability mapping, what one evolved neuron can and can't do, and what I learned about the address space of a tanh neuron under evolutionary pressure. But that's its own story. This post is about what came after those lessons: stacking.

What is GENREG-Volatile?

Standard GENREG already operates without backpropagation. You evolve weights, you evolve structure, and a fitness landscape determines what survives. But there are still things you decide as the architect: how many hidden neurons, what activation function to use, the sparsity level, mutation rates. You set the scaffolding and let evolution fill it in.

Volatile throws all of that away.

In G-Volatile, everything is evolved. The number of neurons in a hidden layer? Evolved. The activation functions? Evolved. K-sparsity? Evolved. Mutation rates? Evolved. There is no fixed architecture. The population explores structural space and weight space simultaneously, every generation. What comes out the other side is not something you designed. It's something that survived.

This makes Volatile models inherently unpredictable. The population is constantly shifting shape. One generation might settle on 47 encoder neurons. The next layer evolves 34. A third picks 45. Different activation functions, different sparsity patterns, different internal structure entirely. But they accomplish the same task. You don't control a Volatile model. You tame it.

The Stacking Problem

Stacking layers in a system like this should be a disaster. In gradient-based deep learning, layers are co-optimized through backprop. Every layer adjusts to every other layer simultaneously. In Volatile GENREG, each layer is evolved independently. There's no joint optimization. No gradient signal flowing backward. And the layers themselves aren't even stable architectures; they're whatever evolution decided to build that generation.

The naive expectation is catastrophic interference. Stack two volatile, independently evolved layers and the second one should thrash against the first.

Freeze and Grow

The solution is a protocol I'm calling auto-stack freeze-and-grow.

Evolve Layer 1 until the population converges on a useful representation. Freeze it.
Layer 1 is now a fixed feature extractor. Evolve Layer 2 on top of frozen L1 output. Let it find whatever structure works. Freeze it.
Repeat.(a little more nuanced this this because of "soft" freezing but deserves its own post as well)

Each new layer only has to solve one problem: "given the frozen representation below me, what can I extract?" It never has to worry about the ground shifting underneath it, because the lower layers are locked.

What the frames show

The attached gif walks through the full training progression:

Gen 1: Layer 1 is live, 47 enc neurons, 22 hidden. Val acc 71.9%. Everything else is "not yet."

Gen 259: L1 and L2 are frozen. L3 is live and training with 35 enc neurons and 21 hidden. Test acc has climbed to 86.2% even though the current layer's val is only 72.9%. The frozen layers beneath are doing the heavy lifting.

Gen 392: Three layers frozen, L4 is live. Test acc 85.93%. L4 is still early, still volatile.

Gen 484: All four layers frozen. Val 86.33%, test 87.12%. The full stack stabilized. No degradation. Clean additive gains across the entire depth.

Look at the val acc chart on the right side of each frame. You can see each layer's contribution tracked independently. L1 oscillates in the 70-80% range. L2 lands similarly. L3 starts climbing. Then L4 locks in and the composite result pushes past 87%.

Nobody told this model how many neurons to use. Nobody picked the activations. Nobody set the sparsity or the mutation rates. Evolution did all of that, independently, at every layer. And the layers still compose cleanly when stacked.

You don't train evolutionary models. You tame them.

https://github.com/A1CST/GENREG_LLM_V1

YEAH FUCKING BUDDY.

GENREG LM

A language-model-shaped pipeline trained without gradient descent and without backpropagation. Parameters were discovered by evolutionary search. The n-gram statistics were counted directly from a corpus.

It is a research artifact, not a chatbot. Expect short English-shaped fragments.

If you came here expecting GPT-2, you are in the wrong repo.

Output is phrase-level, not sentence-level. You will recognize English words and short phrases. You will not get coherent answers or multi-sentence paragraphs.
Topic drifts every 10 to 20 tokens. The model has no long-range memory.
The evolved attention stack actively makes generation worse when blended with the n-gram cascade. Best outputs come from pure n-gram (/ngram 1.0), which is mostly corpus statistics, not learned parameters.
There is no instruction-following, no dialog, no reasoning. Asking it a question will produce text that sometimes looks like an answer and usually isn't.
Numbers, rare words, proper nouns, and punctuation are all weak spots.

If that still sounds interesting: the pipeline runs at all, without a gradient, on a single GPU, in hours of evolution instead of days of backprop.

u/AsyncVibes

GENREG-Volatile(GV): Stacking Evolved Layers Without Performance Degradation

GENREG LM