u/OK_Simon_666

The 0% Challenge: Is any LLM actually "solving" SWE-Bench without memorization?

I've been looking at SWE-Bench leaderboards on and off over the past few years, and something still feels fundamentally broken about how we define "agentic capability."

We keep seeing models hit 30%, 40%, or even 60%+ on SWE-Bench Verified. The hype train says we're nearing "AI Software Engineers." But here's the elephant in the room: contamination isn't just a bug. It's the feature.

The "Air-Gapped" Hypothesis

Consider a simple experiment: force models to resolve issues in a completely isolated environment. No internet access, No searching for similar PRs, No issue IDs in the prompt.

My hot take? Most frontier models would see their scores collapse toward 0%.

Why this might be happening:

Verbatim patching: There's a growing informal consensus among practitioners who've run internal de-contaminated evals that models aren't genuinely "reasoning" through a codebase. Instead, they appear to be recalling specific Git commit hashes and file paths — because large chunks of SWE-Bench exist verbatim in pre-training corpora.

The "search" proxy: Many high-scoring agents use browse/search tools. In practice, they often locate the original GitHub PR that fixed the exact issue they're supposed to solve. That's not engineering. That's plagiarism with a tool-use wrapper.

Environment reality check: A real engineer can debug a legacy, private repo they've never seen before. Current LLMs tend to fall apart the moment you move them from "popular public Python repo" to "private internal codebase."

A small internal data point :

At a previous project, I tested a few frontier models on a set of private, post-cutoff issues from an internal codebase — no internet access, no issue IDs, no public traces. The same model that scored ~30% on SWE-Bench Verified dropped to effectively 0–2%. That's when I stopped treating this as a theory.

A challenge to benchmark creators:

If we want real progress, we need a Dark SWE-Bench:

Issues from private, non-scraped enterprise repos.

Issues created after the model's knowledge cutoff.

Zero external search capabilities during the run.

If a model can't produce a fix without having seen the solution in its training data, we aren't building "engineers." We're building very expensive compression algorithms for GitHub.

Curious to hear from anyone else who has run internal, de-contaminated evals. Did you see a similar massive drop? And has anyone found a model that actually reasons through multi-file dependency fixes without effectively cheating via memory?

reddit.com
u/OK_Simon_666 — 8 days ago