tl;dr LLM performance is inescapably limited by the availability of ground-truth corpus accessibility, and unless they demonstrate the ability to do long-horizon agentic work without being given external ground truth, we will see a bifurcated future where many classes of cognitive work become commoditized but others remain in the domain of humans.

#Preface

I’ve been trying to articulate why I feel like a lot of the arguments about how LLMs demonstrate “judgement” and “intelligence” seem incomplete. I spend every day writing software and doing “complex” things with AI, and I have gotten a lot of productivity out of it, but over time I’ve started to get disillusioned with the hand-waving magic of it all.

Neither camp of the main debate appeals to me. The doom lane (we’re not gonna have jobs, AI is going to do everything we do better rapidly) and the dismissal lane (stochastic parrots) both seem to miss what is actually happening right now.

I’m walking a third path: the technology is real, and the capability gains are real, but the disruption is going to commoditize structured-input cognitive work and leave the unstructured kind alone.

#“The Loop”

Every modern AI system runs on what I’ll refer to as “the loop“: a training process that ingests data, generates outputs, receives feedback signals about those outputs, and iterates. The feedback signal has to come from either explicit human labels, formal verification (does the code compile, does the proof check), or unambiguous outcomes (did the move win the game).

So far, we have achieved remarkable success in turning our entire corpus of human generated digital data and found some incredibly useful patterns in it (honestly sometimes it feels like we’re finding the Names of God), but the problems all have discoverable regularities in the input data AND can be evaluated against some signal. We are mostly working on ground-truth-rich corpus datasets, and the right answers are accessible to the training loop.

In order to generalize, I argue that an LLM must acquire capabilities in domains where no ground-truth corpus exists, and none can be synthesized.

Right now, the dominant form of LLM progress can be described by their software development capabilities. The software development feedback loop has gotten faster and faster, but writing the software loop faster has never made the software loop not just be a faster software loop. The reason I want to focus on software instead of other domains where LLMs apply is because it’s where the AGI argument holds the most strength: something like, RSI will lead to the emergence of AGI.

The thing I notice is that software has ALWAYS been improving software. It’s always been tightening the loop. It has never jumped rails to a different domain. Every time software has gone through a self-improvement work, the generalized capability stayed within the bounds of solving structured-input problems.

Compilers got better. Then they got much better. Then they got compilers that wrote compilers. None of this produced a compiler that could write a contract, or a poem, or a diagnosis. The capability deepened within its native domain and didn’t leak outside it. The same is true of search engines: Google got vastly better at retrieving relevant pages, and PageRank’s descendants now power recommendation systems across the internet, but the loop never produced a search engine that could decide what was worth searching for. Spreadsheets got more powerful. VisiCalc became Excel became cloud-collaborative models that handle billions of cells, and the result was that a job that used to take a week now takes an afternoon, but spreadsheets never became something other than spreadsheets. The internet collapsed the cost of distribution and coordination across every industry simultaneously, which was probably the largest single technological disruption in modern history, and the work humans do on top of the internet looks structurally similar to the work we did before it. These are all vertical disruptions (cratering the price of work closest to the loop) without producing horizontal generalizations.

AlphaFold is the strongest candidate for a software feedback loop generalizing out of its native domain. It started in machine learning and produced a revolution in structural biology. But if we examine what AlphaFold actually had to work with, it has a really rich ground-truth: roughly 170,000 solved protein structures from decades of X-ray crystallography and cryo-electron microscopy, paired with the amino acid sequences that produced them. The structure of every protein in the training set was experimentally verified by humans with physical equipment over decades of patient work. AlphaFold exploited a tractability that crystallographers had been demonstrating for fifty years rather than discovering that protein folding was tractable. Notably, AlphaFold did not generalize into clinical medicine, into patient care, or into the lived practice of being a doctor, it stopped at protein folding.

This (the LLM) loop

I tried to lay out above that so far, that we have seen these loops work super well with structured-input ground-truth-based problem-solving. I do not see evidence that they have or will significantly displace us in meaningful capacity in domains where there isn’t obvious structured-input problem-solving.

Two conditions would have to hold for AGI to emerge from the current paradigm. Either (1) general intelligence is itself a structured-input problem operating over a sufficiently rich corpus, such that scale alone produces it, or (2) the loop must acquire capabilities in domains where ground truth doesn’t exist and can’t be synthesized, which I detailed above as something no software loop has ever done.

The places where LLMs have done the best are canonically the MOST ground-truth-rich domains that exist for cognitive work. Software compilation, tests, and execution steps all provide clear verification. The fact that it’s being eaten first is evidence that the loop is operating exactly where you’d expect it to be, not that it’s exhibiting “judgement” or “taste”. If the loop is simply self-reinforcing, RSI just speeds that up and craters the price of software even faster.

Some will object that ground truth can be synthesized through RLHF, constitutional AI, self-play, or model-generated training data. These methods work when there’s an underlying verifiable signal, like AlphaZero playing itself because the rules of Go define a winner. RLHF trains models to be the kind of correct that humans rate highly, which is a different thing than being correct, and the documented issues with sycophancy, specification gaming, and confident hallucination of plausible-sounding falsehoods, which is exactly what you’d expect from a loop trying to manufacture ground truth it can’t actually access. The synthesis methods extend the loop’s reach into domains adjacent to ones with real ground truth but fall short of breaking out of the paradigm. Recent empirical work supports this: the “Feedback Friction” paper (Ye et al., 2025, arxiv 2506.11930) showed that LLMs plateau below target accuracy even when given access to high-quality external feedback with ground-truth answers, suggesting structural limits to how much the loop can absorb even within ground-truth-rich domains.

What about ...

There are many domains where people have claimed that LLMs are generalizing cognitive tasks that don’t fit the structured-input problem-solving conditional I’ve set in this piece, but I don’t see it.

Mathematical reasoning is the case worth dwelling on, because it’s where bulls claim to see judgment most clearly and where the technical reality is most divergent from the headline. Recent AI mathematical results, such as DeepMind’s AlphaProof reaching silver-medal performance on the International Mathematical Olympiad, are real and impressive. They’re also, when you read the technical writeups, the product of massive search through combinatorial space against a formal verifier. AlphaProof translates problems into Lean, generates candidate proof steps, checks them against the formal system, and iterates. The proofs are valid but they are not, in the sense mathematicians mean the word, insight. They are stitching: finding combinations of lemmas no human happened to try, exploiting the fact that the model can consider vastly more proof paths than a human in the same time. (Mathematician Carina Letong Hong has framed a similar distinction, contrasting theory-building math like algebraic geometry against problem-solving math that operates in finite search spaces “like Go and chess.”) The ground truth is the verifier, and the corpus is the existing body of formalized mathematics. This is the structured-input paradigm operating beautifully.

Compare this to the move mathematicians actually mean when they talk about insight. Alexander Grothendieck reconstructed algebraic geometry in the 1960s by inventing the theory of schemes: a category-theoretic framework that replaced the classical notion of an algebraic variety with something more abstract, more general, and (at the time) wildly unfashionable. The schemes weren’t in the existing corpus nor were they a combination of lemmas no one had tried to combine. They were a new category of mathematical object, invented to reframe the foundations of an entire field. Grothendieck’s collaborators famously found his approach disorienting precisely because he wasn’t solving problems within the existing framework; he was constructing a new framework in which the old problems became almost trivial. His own metaphor for this style was the rising sea: rather than attack a hard problem directly, he would slowly raise the surrounding theoretical level (develop the right concepts, the right abstractions, the right language) until the problem dissolved on its own. The water rose, the rock submerged, and what had looked like an obstacle became a feature of the new landscape.

No current AI does anything resembling this move, and the loop has no mechanism to. Constructing a new category of mathematical object isn’t combinatorial search over existing objects. It’s the generation of a frame that isn’t in the training data, justified by considerations that aren’t formalizable until after the frame exists. The verifier can confirm that schemes-based proofs of classical theorems are correct, but no verifier could have told Grothendieck to invent schemes. The judgment that drove the work (that this reframing would pay off, that the abstraction was the right one to pursue, that the years of foundational development would eventually yield results) was the kind of cognitive move that current AI can’t do, hasn’t done, and has no apparent mechanism for doing.

Creative writing is the case where the surface mimicry has gotten genuinely impressive and the underlying gap has gotten harder to articulate. Current models produce fluent prose, coherent stories, and recognizable stylistic imitation. The training corpus is the entirety of human published writing, and the feedback signal is statistical fit to that corpus, refined by human raters telling the model which outputs they prefer. This is enough to clear the bar of competent generic prose and that category of work is in real trouble. What hasn’t fallen is the upper register: writing that’s doing meaning-making rather than meaning-recombination. The signature of AI fiction, even at its current best, is that it’s fluent and structurally hollow. It has the shape of a story without the underlying generative process that produces meaning. Readers can often feel this without being able to articulate it.

Language translation arguments dissolves on inspection of the training data. Parallel corpora exist at industrial scale: every United Nations document is published in six languages, the European Parliament publishes proceedings in twenty-four, subtitled films and dubbed media provide billions of aligned sentence pairs, and the open web is full of professionally translated content with the source text adjacent. “This sentence in language A corresponds to this sentence in language B” is one of the most ground-truth-rich training signals that exists for any cognitive task. A more interesting version of the translation case is the decoding of ancient languages, where AI has made real contributions to reading texts no living person could read. The Vesuvius Challenge recovered passages from Herculaneum scrolls that had been unreadable for two millennia. ML-based methods have accelerated cuneiform translation. These are impressive results and they look, on the surface, like decoding rather than translation. but examine the cases and the same pattern appears. The Herculaneum work was image-recovery on damaged text written in known Greek; the underlying language wasn’t unknown, only the visible surface was destroyed, and the ground truth came from passages where ink remained legible plus the entire corpus of classical Greek. Cuneiform translation works because Assyriologists have spent 150 years building scholarly translations that serve as the training corpus. Earlier successes like Linear B and Ugaritic depended on the lucky discovery that the underlying language was related to a known one (Greek for Linear B, Northwest Semitic languages for Ugaritic) which gave the decoder ground truth to align against. The negative case is Linear A, which has a substantial corpus, centuries of expert attention, and modern computational methods applied to it, and which remains undeciphered. Ancient language work succeeds where ground truth exists in some form (a cognate language, a known underlying language with damaged surface, or an existing scholarly corpus) and fails where it doesn’t.

Medical diagnosis is a harder case and real capability gains are happening. AI systems are now matching or beating specialist doctors on specific tasks: radiology reads for certain cancers, dermatology classification of skin lesions, pathology slide analysis, and retinal scans for diabetic retinopathy to name a few. These results are the parts of medicine with the cleanest ground truth: image classification against confirmed biopsies... which are structured outputs evaluated against structured outcomes. What hasn’t fallen, and shows no signs of falling, is everything that constitutes the practice of being a doctor: integrating ambiguous patient history, weighing how much to trust a self-reported symptom against what the labs show, navigating the conversation about a frightening diagnosis, managing chronic conditions where the right treatment depends on what the patient will actually do, and taking legal and ethical responsibility for the decision. Chunks of medical work will be eaten, but the unstructured territory in medicine is also larger than many acknowledge, and it’s where doctors actually spend most of their time.

Scientific research assistance follows the same pattern as mathematics, with the same boundary in the same place. AI is meaningfully accelerating the structured-input parts of scientific work of literature review, hypothesis generation from existing patterns, experimental design suggestions based on prior experiments, automated analysis of data with known structure, protein design, and materials screening. Every single one of them search through combinatorial space against accessible ground truth: proteins fold or they don’t, materials have measurable properties, and hypotheses are restatements of patterns in published work that no human happened to combine. What isn’t being accelerated is deciding which research programs are worth a decade of work, recognizing anomalies that don’t fit existing frameworks and trusting the anomaly over the framework, knowing when to abandon a productive line of inquiry because something more important has appeared, and building the institutional and intellectual conditions that let young researchers do their best work. Kuhn called this paradigm-shifting science, and his core observation was that paradigm shifts are not produced by optimization within the existing paradigm, but they’re produced by a different cognitive move entirely, the same move Grothendieck made in mathematics.

Persuasion is the case where the published results are most overstated relative to what the underlying capability actually shows. Recent studies have demonstrated that AI systems can be more persuasive than human controls in specific experimental settings: one-shot text exchanges with strangers, structured debate formats, and A/B tests on political messaging. These findings get cited as evidence that AI is acquiring “social judgment”, but the experimental setting is where the smell is. Persuading a stranger in a single textual exchange is closer to a structured-input problem than it appears: there’s substantial training data on what arguments move which demographics, the success metric (did they update their stated view) is measurable in controlled conditions, and the interaction has no history and no future. Real-world persuasion has none of these properties. Changing a colleague’s mind requires sustained relationship and accumulated specific credibility. Building trust to enable a hard conversation takes years and depends on consistency across hundreds of small choices. Navigating a family conflict requires holding the entire history of the family in mind while engaging the specific moment. None of this is in any training corpus the loop can access, none of it has measurable ground truth, and none of it has been demonstrated by any AI system.

Across every case, AI is producing massive capability gains in the structured-input regions of each domain, and simultaneously showing no signs of acquiring the unstructured capacities that are being predicted. There is not a single demonstration where LLMs have achieved long-horizon agentic work in domains without external ground truth.

If that changes, my position is wrong. Here’s the specific case I’d worry about:

Consider an AI agent given a multi-month project with no clean reward signal: “make this startup successful,” “diagnose what’s wrong with this organization,” or “figure out what research program is worth pursuing.” There’s no corpus of ground truth for “successful startup outcomes” with the structure the loop needs. The agent would have to generate its own ground truth through interaction with reality by taking actions, observing consequences, integrating feedback that emerges from its own choices, and persisting through long horizons without external verification. That capability would be genuinely new. No software feedback loop has ever done it, and the philosophical argument for why the loop is bounded would be in serious trouble if one did. This is also where current AI research is hitting walls. Long-horizon agentic work is the active frontier; the METR doubling graph measures task length in domains with accessible ground truth, and no equivalent measurement exists for domains where ground truth has to be constructed by the agent in real time. If that capability emerges and starts scaling, my position is wrong, and I’ll say so. Right now it hasn’t, and the loop’s nature suggests reasons it might not.

Doomsday Ice Cream

I claimed that AI can’t develop judgement without a ground-truth corpus (or at least that it hasn’t happened yet), but one might question: If humans developed judgement with no ground-truth corpus, why can’t AI?

There’s a gag in Futurama where a Renaissance-era Leonardo da Vinci has built a doomsday machine. The doomsday machine has an unexpected feature: it also makes ice cream, but it wasn’t designed to make ice cream. The ice cream is a side effect of the mechanism that was supposed to end the world.

Human judgment is the ice cream.

Evolution was a differential-reproduction process under embodied, mortal, social, and geological conditions, not a system tasked with developing judgement. Human judgment is the side effect - the ice cream that fell out of a machine that was optimizing for other things entirely. (aside: Stephen Jay Gould and Richard Lewontin called such features “spandrels” - traits that arise as architectural byproducts of selection for other things). We seem to be trying to manufacture human-like judgement by scaling next-token predictor systems and hoping that judgement falls out, but we don’t actually have the blueprint for how to make a “judgement machine”. We rely on the implicit assumption that judgement is the natural attractor of any sufficiently complex optimization process. But the only existence proof we have for human judgement was produced by a process that wasn’t targeting judgement, while operating under conditions that current AI training shares none of.

What we might get instead is genuinely useful capabilities that are real, valuable, and shaped differently from human judgment. Different ice cream from a different doomsday machine. The current paradigm is producing extraordinary structured-input problem solvers and there is nothing wrong with that. The mistake is calling the outputs “general intelligence” or “judgment” and extrapolating as if you’ve built something that produces those things by design, which we haven’t. We’ve built a specific machine with specific outputs, and the bonus capabilities at the edges are interesting but they’re not what the machine is for.

It may be the case that human judgement is a capacity that only develops in systems with skin in the games. With exposure to consequences over time, in embodied conditions, and where being wrong has some real costs. There are no direct analogies in the LLM software loop we are seeing today, and there isn’t a theoretical or empirical reason to assume that capacity can emerge from a process without those conditions.

That said, “it never happened before” is a bad argument. “No machine will ever fly” was wrong. Many structurally similar arguments have been wrong. The disanalogy is that flight had a clear physical mechanism that humans could observe operating in birds (lift, thrust, the mechanics of wings) and the question was whether humans could engineer the mechanism. The case for AGI doesn’t have an analogous mechanism. It has “scale produces emergence” as a hope, not a theory. The skeptics of flight were wrong because they ignored an observable mechanism, while the skeptics of AGI are pointing out that there isn’t one.

#Implications

If the loop is bounded the way I’ve argued - that is: operating within structured-input domains, and unable to acquire capabilities in domains without accessible ground truth - then the disruption it produces will be a bifurcation where cognitive work splits into two categories and the categories diverge in value.

On one side, work with accessible ground truth, which will become incredibly commoditized. The marginal cost of producing it approaches the marginal cost of running the model:

Basic legal research where the answer is existing case law
Financial modelling steps with structured inputs and outputs
Medical imaging reads where the diagnosis can be confirmed by biopsy
Commercial floor content production where “good enough” is the bar
The humans who currently do it as their primary job face severe pressure.

On the other side: work that resists structuring:

Sustained relationships where trust is the asset and trust is built over years of specific shared history.
Embodied presence in physical spaces where the work happens.
Licensed accountability where someone has to sign their name and bear the legal consequence.
Novel judgment under genuine uncertainty where the ground truth doesn’t yet exist and won’t exist until after the decision is made.
Paradigm-shifting insight in domains where the right answer requires generating the frame, not optimizing within an existing one.

This work will grow in premium, because the supply of capable humans stays roughly constant while the supply of capable AI substitutes never materializes.

I’m not sure that the future is between everything falling or nothing falling. The reality is that two things seem to be happening at once on different curves, and the most valuable positions are on the durable side of the bifurcation while the most valuable bets are against the structured-input side at scale breaking paradigms.

(end note: I copied and edited this post over from my Substack but have no reach over there and want to see what other people think about this perspective here).

u/Rcraft

AI capability forecasts deserve better models than curve fitting (ft. LPPLS)

Didier Sornette, the dragon-king, and LPPLS

What software has always done, and what it hasn’t

This (the LLM) loop

What about ...

Doomsday Ice Cream