
I just watched this unbelievable video on another subreddit where Anthropic claims that their "natural language autoencoders" can translate Claude's next-token prediction into text.
They make a number of claims based on reading the "thoughts" of Claude that I'm having trouble wrapping my head around. Claims like, "We've found that Claude has internalized being a helpful AI model," and, "Claude knew that it was being given a test."
From what I know of next-token prediction, there is no way that you could translate it into coherent "thoughts" of English-language text. LLMs convert words into tokens and then analyze patterns between the tokens in its training data to predict what token/word is most likely to come next. It doesn't think, "I'm being given a test," rather it says, "I'm being a given test," because it's predicted those are the most likely words to come next based on the map of word associations it has from its training data. If you were to translate an LLM's analysis, the daisy chains of tokens that constitute an LLMs "thoughts," back into words, wouldn't it just be a total jumble?
So is this total anthropomorphic BS from Anthropic or am I missing something?