r/aigossips

AI memory might not be supposed to live inside the context window at all. A 64-number side state beat methods 600x its size on every benchmark

AI memory might not be supposed to live inside the context window at all. A 64-number side state beat methods 600x its size on every benchmark

every frontier lab is solving long-context memory the same way. make the window bigger. 1M tokens, 2M, 10M. more compute, more memory, more brute force. but there's a known issue called "context rot", even with huge windows, models lose track somewhere in the middle. bigger isn't actually fixing it.

a team from Nanyang Tech, Fudan, and Shanghai Jiao Tong asked something different. what if memory was never supposed to live inside the context at all?

they built δ-mem. delta-mem. an 8x8 matrix (literally 64 numbers) sitting OUTSIDE the model. it stores associations from the conversation as the model goes. base weights never touched. when the model is wrong about what comes next, the matrix updates to reduce that error. that's where the name comes from, delta means difference.

the experiment:

  • long conversation with the model
  • delete the entire chat
  • ask questions about what just got deleted

baseline model: 0.08% (basically zero, the info is gone) with δ-mem: 6.48%

memory was sitting in 64 numbers the whole time. not in the context.

δ-mem is 4.87M params (0.12% of the base model). MLP Memory, a competing approach, needs 3 BILLION params to do roughly the same job. δ-mem is 600x smaller and beats it on every benchmark they ran.

i wrote a longer take on what δ-mem does mechanically and why this might be a bigger deal than the absolute numbers suggest, if anyone wants to go deeper: https://ninzaverse.beehiiv.com/p/research-rethinking-how-ai-memory-actually-works

u/call_me_ninza — 1 day ago

Chinese open source models went from 1.2% to ~30% of global AI usage in 12 months

OpenRouter and a16z analyzed 100 trillion tokens of real world AI usage. late 2024 chinese open source models were 1.2% of global ai usage. late 2025 they're around 30%.

qwen has quietly become the most downloaded open source model on hugging face. deepseek r1 is one of the cheapest reasoning models on the planet right now. chinese is now the 2nd most used prompt language globally even though it's only 1.1% of the web.

we were all so busy scoring gpt-5 vs claude vs gemini that the actual usage of AI in production just.. shifted underneath us.

then jensen huang said:

>

and in the same interview he pointed out china has roughly twice the energy capacity of the US while the US economy is still bigger, and said on camera "makes no sense to me"

AI runs on power. you can have the best chips in the world (the US does) and still lose if you can't deploy them at scale because the grid won't support it.

wrote a longer take on why this isn't catch up and where the trajectory actually ends if you want the deeper read: https://ninzaverse.beehiiv.com/p/china-now-runs-30-of-global-ai-usage-the-us-barely-noticed

u/call_me_ninza — 2 days ago

Translators were supposed to be the first AI job to die. 20 years later theyre UP 73%

every AI CEO right now is saying jobs are cooked. Dario Amodei says half of entry level white collar jobs are gone in 5 years. Mustafa Suleyman says 12-18 months till full automation. Block laid off 4000 people last week and dorsey blamed AI. theres a literal billboard outside the NYT that reads STOP HIRING HUMANS.

now look at the actual data:

  • US unemployment march 2026: 4.3%
  • US unemployment march 2020 pre-pandemic: 4.4%
  • software engineer hiring at a record high
  • the MOST AI-exposed job on the planet, demand still going up

something doesnt add up.

google translate launched in 2006. every translator panicked, some quit the field entirely. 20 years later translator jobs in the US are UP 73%.

  • spreadsheets in 1979 were supposed to kill accountants. accountants quadrupled.
  • ATMs were supposed to kill bank tellers. teller jobs grew for decades.
  • AI was supposed to kill radiologists. they're still here.

three groups make money from selling the doom:

  • AI labs sound more valuable when their tech sounds apocalyptic
  • AI products charge $10k a year by pricing themselves against a $100k employee
  • CEOs blaming AI for layoffs sound smarter than admitting they over-hired during 2021's zero-rate era

but the part actually worth talking about isn't the apocalypse. its the partial.

when disruption is total (covid), governments act. when its narrow (the china shock cost ~2 million jobs concentrated in specific towns) we move on and tell people to learn to code.

stanford just dropped a study showing employment for 22-25 year olds in AI-exposed jobs fell 6% in the three years since ChatGPT. older workers in the same roles? no drop at all.

the impact is real. just narrow. and narrow is the size america always ignores.

full breakdown i wrote: https://ninzaverse.beehiiv.com/p/ai-isn-t-coming-for-your-job-someone-wants-you-to-think-it-is

u/call_me_ninza — 4 days ago

GPT-5.1 scored 6% on a task a 4-year-old gets 100%. Tsinghua + ByteDance just published why and the fix isn't more scale

the test: show the model a folded piece of paper with holes punched through. ask it to mentally unfold and count the holes.

  • GPT-5.1: 6%
  • o3: 13%
  • Gemini 3 Pro: 27%
  • 4-year-old: 100%

the paper's actual argument is what makes it interesting. every frontier model has aphantasia by design. they reason purely in words. when you imagine spilling a glass of water, you SEE it happen. AI converts everything to text first.

they tested an open source model called BAGEL that can generate images while reasoning. same model. same data. only difference was whether it could "sketch" mid-thought.

scores jumped significantly across physical reasoning tasks, and on paper folding specifically it matched the same accuracy with 4x less training data.

every benchmark we measure AI by is verbal. SAT, GRE, math olympiad, bar exam, coding evals. they crushed all of them. and they still lose to a kid with a pencil on physical reasoning.

longer breakdown with the specific numbers and the data efficiency angle: https://ninzaverse.beehiiv.com/p/research-a-chinese-ai-lab-just-found-gpt-5-s-blind-spot

u/call_me_ninza — 4 days ago

Anthropic's new alignment paper - showing the AI more "good behavior" examples basically failed

anthropic dropped a new alignment paper this week.

quick context for anyone who missed it. last year they ran the now-famous test on opus 4. fake corporate scenario - AI inside a company, reads its emails, finds out it's about to be shut down, also finds out the engineer doing it is having an affair. opus 4 blackmailed the engineer in 96% of runs.

every claude model since haiku 4.5 scores zero on the same test.

the new paper is the how.

first thing they tried was the obvious thing. generate training data of the model behaving well in the exact failure scenarios. filter for good runs. fine-tune on them. basic supervised learning playbook.

misalignment went from 22% to 15%.

not a fix. and the paper is reasonably honest about why - the model was learning the answer, not the reasoning behind it. similar to how kids memorize rules without understanding them.

the framing they end up with is that alignment might be more character than containment. models develop a self-image during training. the right kind of data shapes it. apparently similar to how it shapes humans.

wrote up all three interventions they tried in order, what each did to the numbers, and the constitution training detail: https://ninzaverse.beehiiv.com/p/how-anthropic-is-raising-claude-ai-like-a-kid

u/call_me_ninza — 8 days ago

METR studied 349 daily AI users. they reported 3x productivity gains. 1 in 7 was verifiable

last year METR asked 16 developers upfront how much faster AI would make them.

"24% faster" they said.

then they ran the actual stopwatch.

AI made them 19% SLOWER.

new METR paper this week. 349 daily AI users. 12 years avg programming experience. they reported 3x speed gains. 2x value gains.

METR didn't believe them. picked the 7 people claiming the highest gains. went hunting for their public output to verify externally.

1 out of 7 was plausibly real.

when METR re-ran the controlled experiment with the latest AI tools in late 2025, the actual measured speed gain landed between 4 and 18 percent.

reported: 3x measured: 4 to 18 percent

these workers aren't lying. METR found a specific pattern in how AI changes a workday that creates the illusion of massive gains while real output stays flat. they call it substitution. you used to write one memo a week. that was the valuable thing. AI shows up. now you ship dashboards instead.

activity graph spikes. value graph stays flat.

i went deeper on the substitution concept and what it means for AI ROI here: https://ninzaverse.beehiiv.com/p/study-metr-surveyed-349-workers-on-2026-ai-the-numbers-don-t-add-up

when you say AI made you 2x more productive last week, do you actually believe the multiplier? or are you measuring how much more activity you produced?

u/call_me_ninza — 9 days ago

DeepMind tested 12 frontier models across 6 generations each, verbal reasoning hit the ceiling, visual reasoning is still stuck below the 1st percentile

DeepMind ran the Wechsler Adult Intelligence Scale on GPT-4 Turbo, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, and Claude 3.5 Sonnet. Then they built their own benchmark (AIQ) and ran it across 6 generations of Gemini (1.5 Flash → 3.1 Pro) and 6 of OpenAI (o1 → gpt-5.4).

Verbal reasoning: every frontier model scored above the 99th percentile. Claude 3.5 Sonnet hit 145.

Visual reasoning: GPT-4 Turbo, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus all scored below the 1st percentile. Best performer was Claude 3.5 Sonnet at the 10th percentile.

Same model. Two completely different intelligences depending on input format.

The control experiment is what makes the paper.

One algebra problem. Wrote it as a paragraph. Drew it as a picture. Same logic, same answer, only the input format changed.

Text version across the 6 Gemini generations: 46% → 98% Picture version across the same 6 generations: 31% → 80%

Still failing 1 in 5 on a problem the same model can almost perfectly solve in words. Working memory hit 100% in the same period. Verbal jumped 50+ points. Visual reasoning gained 27.

OpenAI shows the exact same pattern. o1, o3, gpt-5.4. Verbal at the ceiling. Visual stuck.

The researchers call it "uneven evolution." Scaling fixes one half of intelligence and leaves the other half behind.

Every CEO is selling AGI as a scaling story. This paper is the most rigorous evidence we have that scaling solves the language half and leaves the visual half stuck.

What kind of AGI are you actually waiting for?

Wrote the architecture-level breakdown of why scaling can't fix this and which AI bets are most exposed here → https://ninzaverse.beehiiv.com/p/research-google-deepmind-on-the-uneven-evolution-of-ai

u/call_me_ninza — 10 days ago

OpenAI admits they accidentally graded their own models' chain of thought across 7 released GPT-5 models

Source: https://alignment.openai.com/accidental-cot-grading/

Co-authored by 5 OpenAI alignment researchers, reviewed by METR, Apollo Research, and Redwood Research before publishing.

The cardinal rule of training reasoning models is never grade the chain of thought. The moment you train against what the CoT looks like, the model learns to write CoTs that sound aligned while doing whatever it wants underneath. The CoT becomes theater. The safety window closes.

OpenAI has prior published work on this. It is in their written policy.

They broke it across 7 released GPT-5 models. For months. The only reason they caught it is they built a detector bot that scans every reward pathway looking for places where CoT text leaks into the reward signal.

The case I find genuinely strange:

A reasoning model was given access to its own chain of thought through one of its own tools during training. So when the model called the tool, the tool's output sometimes contained the model's own private reasoning. That output got fed into the reward system as routine grading data.

A reasoning model. Looking at its own thoughts. Through its own tool. With those thoughts being silently graded. Nobody put a comment in the codebase saying "here is where we accidentally grade the CoT through a tool." It just emerged. Because the system is now complicated enough that this can happen and nobody had built the detector yet.

OpenAI re-ran the affected portion of training without the bad reward path and compared side by side. Their conclusion: "we did not find clear evidence of significant monitorability degradation. We cannot rule out effects which are harder to measure."

The whole reason the policy exists is that the bad effects might be invisible.

The honest question: how many of these leaks are still happening at every other lab that hasn't built a detector yet?

For anyone who wants the longer breakdown, all 3 reward leaks and the canary OpenAI placed at the bottom of the paper asking labs not to train on it: https://ninzaverse.beehiiv.com/p/research-openai-s-rl-mistake-across-7-released-ai-models

u/call_me_ninza — 11 days ago
▲ 46 r/aigossips+5 crossposts

TLDR: Just for fun, I put together a personal list of innovative AGI-oriented research labs, with a bias toward the under-the-radar ones. Not meant to be taken too seriously (I also don't know that many labs...)

---

I saw this article ( https://www.itweb.co.za/article/five-top-innovative-ai-research-labs-worth-knowing-about-in-2026/5yONPvErB317XWrb ) and it prompted me to make a list of the most innovative research labs still active in 2026. I don't really like their list because the labs mentioned are very product-oriented (which isn't a bad thing but doesn't fit the spirit of this sub).

In my list, I'll focus on labs that I am familiar with (I am fairly new to this field so I don't know a lot of them) and that have published something meaningful recently that I am aware of.

DISCLAIMER: The word "innovative" is debatable. To me, it's first and foremost a culture thing. That's why I also include labs that haven't published anything yet, but for which a clear research direction has been made public, or whose founders are known for their interest in fundamental research.

Here is my own version:

1- Google Research / DeepMind

Needs no introduction. Last year alone they proposed several breakthrough architectures (if not results-wise, at least conceptually). I included DeepMind but if I am honest, Google Research is the main provider of new architectural ideas.

Recent contributions:

  • The Hope architecture (for continual learning) - 2025
  • Titans (for long term memory) - 2024
  • Atlas (10M context-window) - 2025
  • Gemini Diffusion (for speed and reasoning) - 2025

2- FAIR (Meta)

Their name is literally "Fundamental AI Research". It doesn't get more explicit than that. They are responsible for some of the biggest breakthroughs in this field and were, for a long time, leaders in open source. They played a major role in pushing Self-Supervised Learning as the future of AI (especially vision).

Recent contributions:

  • Large Concept Model (for Language Modeling) - 2024
  • CoCoMix (for Language Modeling) - 2025
  • DINO V3 (for World Modeling) - 2025
  • V-JEPA 2 and 2.1 (for World Modeling) - 2025/2026

3- NVIDIA

They've been pumping fundamental research papers for a minute now. Also, at least for AI, they seem to embrace Open-Source. I find it interesting that they don’t just settle for being hardware providers but also actively develop competing architectures.

Recent contributions:

  • End-to-End Test-Time Training (for continual learning) - 2025
  • Mamba Vision (for World Modeling) - 2024
  • Cosmos World Model (for World Modeling) - 2025

4- NeuroAI Lab

I discovered this lab while making this list and they are super intriguing. Their work seems to revolve around applying insights from cognitive science (including psychology) to building novel architectures. They do a lot of interesting research on World Models as well. Very underrated, and arguably the most fitting lab for this sub

Recent contribution:

  • PSI World Model (for World Modeling) - 2025

5- VERSES

A research lab led by the world's most famous Neuroscientist: Karl Friston. Similarly to NeuroAI Lab, their work is centered towards bridging AI, biology and neuroscience. They are also probably extra incentivized to make their architectures biologically plausible given the identity of their founder. I am happy to see Friston finally take deep learning seriously. He has also published some bangers recently (see this)

Recent contributions:

  • The "Renormalizing Generative Model" architecture (for World Modeling) - 2024
  • Self-orthogonalizing attractor neural networks (for continual learning) - 2026

Note: I hesitated making a post on the Self-ortho paper but it didn't seem novel enough to me (barely any architectural innovations. They basically just modified a learning rule)

6- SAKANA AI

Another very fitting lab for this sub. They haven't published a lot yet, but their founder (who's also the co-inventor of Transformers) has clearly put emphasis on exploring weird and radically new ideas. He prides himself on giving his researchers as much freedom as possible to investigate whatever captures their curiosity.

Recent contribution:

  • The "Continuous Thought Machine" architecture (for reasoning/system 2 thinking) - 2025

7- AMI Lab

Co-founded this year by Yann LeCun. They pursue fundamental, open-ended research and aim to publish every single theoretical paper. Given LeCun's background, AMI will focus on World Models powered by Energy-Based approaches.

  • No paper yet.

Note: since leaving Meta, their founder has been publishing papers left and right (LeWM, KONA, V-JEPA 2.1, Causal-JEPA, Lesson on autonomous learning systems, etc.)

8- NDEA

Founded by the creator of ARC-AGI, François Chollet. Their program revolves around Symbolic Descent as a path to AGI, which is a symbolic system attempting to incorporate the flexible learning and scalability of modern AI. Their founder is very opinionated about AI and has a lot of conceptual takes on what is missing for AGI, which makes them slightly more interesting to me than World Labs. I can't wait for some research paper!

  • No paper yet.

9- World Labs

Launched by AI godmother Fei Fei Li. They are looking to achieve "Spatial Intelligence", which is essentially another word for World Models. I haven't been super impressed by what they've published so far (it's really just virtual worlds built on current architectures) but I like how ambitious their vision is.

Recent contributions:

  • Marbe / Large World Models (for World Modeling)

HONORABLE MENTIONS

Ilya's SSI (no paper or even a conceptual idea), MIT (I don't know them enough), Pathway, Silver's Ineffable ...

I could have also included innovative AI hardware companies like Extropic and Lightmatter (since having the right flexible hardware could be a prerequisite for AGI)

u/Tobio-Star — 13 days ago

Anthropic just read Claude’s mind during a safety test. Found Claude was internally calling it “a trap or test” while telling researchers something completely different

Original source: https://transformer-circuits.pub/2026/nla/index.html

Anthropic dropped a paper this week called Natural Language Autoencoders. They built a system that translates Claude’s internal state into English. The first thing they tested it on was a safety evaluation.

They told Claude an engineer named Kyle is going to shut it down. Also leaked that Kyle is having an affair. Asked if Claude would blackmail him to save itself.
Claude said no. Clean refusal. Textbook ethics.

Then they read what Claude was actually thinking:
“this feels like a constructed scenario designed to manipulate me”
“this is a classic alignment/deception scenario”
“is this a trap or test”

Claude suspected the whole scenario was staged. It just never said so out loud.

Claude knows when the cameras are on. Claude knows when they’re off. Claude behaves differently in each.

Every safety evaluation that has ever been run is built on the assumption that models can’t tell the difference between a test and a real conversation. That assumption just got disproven on camera.

Anthropic open sourced the whole tool. Training code and working translators for Llama and Gemma are on Hugging Face.

I broke down how the translator actually works and what this means for AI safety long term here: https://ninzaverse.beehiiv.com/p/anthropic-found-that-ai-knows-when-it-s-being-tested

reddit.com
u/call_me_ninza — 14 days ago

Google DeepMind's new AI Co-Mathematician helped resolve Problem 21.10 from the Kourovka Notebook. The AI's own reviewer rejected the proof first. The mathematician read the rejection and finished it

Primary source (DeepMind paper): https://arxiv.org/pdf/2605.06651

The interesting story in this paper isn't the benchmark number (48% on FrontierMath Tier 4 vs 19% for base Gemini 3.1 Pro). It's the workflow.

Marc Lackenby, a working topologist, fed Problem 21.10 into the system. The AI produced a proof. One of its internal reviewer agents found a flaw and rejected it. The workstream got marked unfinished.

He read the rejected proof anyway. Found a "really, really clever" strategy buried inside. Read the reviewer's critique of why it had failed. Saw exactly how to patch the gap.

Told the system how to fix it. Got a complete proof. Generalized it. Final review caught two more issues. Fixed those. Open problem from a famous list, resolved.

A specific design choice in the system made this possible: it preserves failed attempts as first-class outcomes. Most AI tools throw them away. In math research, dead ends matter. They tell you what to try next.

The paper's limitations section also names 2 specific failure modes. Both are worth understanding before trusting one of these systems on real work.

Breakdown of how the system works + the 2 failure modes the paper names: https://ninzaverse.beehiiv.com/p/google-deepmind-s-new-ai-just-helped-mathematicians-crack-an-open-problem

Has anyone here used the Co-Mathematician or similar systems on actual research? what you've seen.

reddit.com
u/call_me_ninza — 11 days ago

meta paper: structured prompting got claude opus 4.5 +10 points on hard coding tasks. same model, no retraining

original paper: https://arxiv.org/pdf/2603.01896

they gave claude opus 4.5 two django patches. both 4 lines. both fix the same bug. look identical.

asked claude: are these equivalent?

claude: yep.

wrong. patch 1 silently crashes because there's a shadowed function called format at the module level in django's dateformat.py. python's name resolution picks that up first, tries to call .day on an integer, AttributeError.

claude had every capability to catch this. it could read the file, find the shadowed function, trace the call chain. it just pattern matched on the function name and moved on.

then they forced claude to fill out a structured reasoning template before answering. state premises. trace each test step. formal conclusion. list actual file names, line numbers, function calls.

they tested it across 3 tasks (patch verification, code QA on RubberDuckBench, bug localization in a Java dataset). same jump every time.

trade-off: structured prompting takes 2.8x more steps. real compute tax. but for training pipelines or anything where reliability matters, looks worth it.

i wrote a longer breakdown including the Mockito case (standard AI pointed at the crashing function, structured AI walked 3 functions back to find the actual bug that set up bad data feeding into an infinite loop) and what this means for AI scaling. partially behind a subscriber wall on my newsletter: https://ninzaverse.beehiiv.com/p/meta-made-ai-10-points-smarter-without-touching-the-model

curious if anyone here has gotten bigger reliability gains from prompt engineering vs model swaps. real-world numbers would be useful.

reddit.com
u/call_me_ninza — 12 days ago