Microsoft just beat Anthropic's best model without having a model. And it used Anthropic's own model to do it.
So this dropped May 12 and I don't think enough people are talking about it in the context of agentic AI.
Microsoft built something called MDASH, an AI security system, and topped a benchmark called CyberGym with 88.45%. Anthropic's Mythos Preview came second at 83.1%, OpenAI's GPT-5.5 third at 81.8%.
Here's the thing. Microsoft doesn't have a frontier model. MDASH runs on publicly available models, including models from the exact companies it just outranked. It assembled 100+ specialized agents across a five-stage pipeline, preparation, scanning, verification, deduplication, proof, with different agents and different models handling each stage. Large models for heavy reasoning. Smaller distilled models for high-frequency verification. And the whole system is model-agnostic, swap the underlying model, the pipeline stays.
It used someone else's bricks and built a taller building.
And it's not just a benchmark flex. Microsoft used MDASH to find 16 actual Windows 11 vulnerabilities, 4 of them Critical-level remote code executions. These are in the May Patch Tuesday. Real CVEs, not demo outputs.
The architecture question this raises is the one I keep thinking about. If a multi-agent system can outperform the single model that powers it, what does that mean for how we build things on top of AI?
And this is where World's AgentKit becomes relevant.
Because as these agentic systems proliferate, the question stops being "can agents do the task" and starts being "who authorized this agent, and is there actually a human behind it." MDASH at 100+ agents is impressive. MDASH at scale, used by people who don't know what they're doing, or used by attackers who do, is a different story entirely. The article itself says this: MDASH uses all publicly available models. There are no exclusive technical barriers. Anyone can build this.
AgentKit is trying to solve the identity layer for exactly this world. Proof of human attached to agentic workflows so you know there's a verified person behind the action, without exposing that person's data. That becomes load-bearing infrastructure as agent systems move from experiment to production.
MDASH is the demo that the production era is already here.