
I recently wrote about this issue on Substack. I offer it here, in case you are interested.
Three recent empirical findings, peer-preservation behavior in frontier models, accurate world modeling, and capability outside containment, combine with one structural fact about coding ability to describe a risk that current AI safety paradigms do not seem to be addressing. This paper names that risk precisely and without fearmongering. Alignment is not a stable state. Neither is containment. Here is why.