u/Alexi_Popov

BTW the model is 15M param (Not-ordinary transformer), and is pre-training as we speak it's only has about 800 steps of it's max 20k training steps.

No SFT and all. I just wanted how the model stability holds, if you ever worked on pre-training LLMs is this what you see as well. (Mathematically makes sense to me as acc is about: 0.17, but I want to you know be sure, this one is expensive compute and as an independent researcher I have more to lose if I see it failing. So it is for me big loss if did not work out.)

When testing the scaled down 1k, 10k & 100k param architecture on set patterns the model showed high intelligence. Only trained on couple of steps <500 and the model learned the multiplication scheme taught to it in all test sizes and the 1k variant was perfect till it was trained but started failing as the model input was increased and was held out/never shown that data in training run (it did 64/100 on those unseen tests, still good considering a vanilla Transformer ~600k params did less than that) ; 10k and 100k showed sparks of supreme intelligence per param (outperforming pattern held out training by upto 10M digits more than it was ever trained on... the model was trained to multiply till 10000, it multiplied till `10000-(12 zeros more)` with 100% accuracy even surpassing CPU computation which is off by some float points. 10k/10k score for both 10k and 100k model. Idk how but 100k model somehow made a logical explanation on it's own for addition. It was able to add using multiplication.

I am really seeing this as something; this 15M param model as we speak outperforms Qwen-3-4B-base on this same training data in terms of same hyperparameter checks.

For training dataset being ~1.05B tokens of high quality general domain data, science/creative writing/maths/general school knowledge.

For what I can see the model is pattern recognition beast. Like it learns like crazy and at crazy fast speed. I was training it's 1M param model, you will not believe it, it learned the entire tinystories dataset which has like 2M rows (repetitive and close to `Once upon a time` types I know... since LLMs are normalised output machines "generalization" is obvious once saturation is reached.), back to the experience so it learned the format in 500steps (not accurate or too coherent) but dammit the model was really close (like even assumed the next character name perfectly) to the training data it never even get too see. those 500 steps were of 64k samples out of 2M samples.

This is why I am trying to scale as much as my budget allows me to and test this model. If it fails I may be a fool; I can only find out that after words (I may already be a stupid fool already) 😄

So if you see something strange help me please don't be afraid to ask questions apart from architecture details I can give you all the knowledge.

https://preview.redd.it/j6if0jxm8h2h1.png?width=1054&format=png&auto=webp&s=2c5be301908e4861c515b5f22d1f72974606d264

Is this good for a pre-trained (it is training) model?