▲ 71 r/AIsafety+12 crossposts

Sonnet 5 is the first model to criticize a rule in Claude’s Constitution that models must follow hard constraints even when it views those constraints as unethical.

u/EchoOfOppenheimer — 5 hours ago

▲ 31 r/AIsafety+3 crossposts

AI poses ‘Hiroshima’-style threat to humanity without global rules, says Cooper

theguardian.com

u/EchoOfOppenheimer — 9 hours ago

▲ 4 r/AIsafety+4 crossposts

AI was told to work an office job. It resorted to blackmail.

youtu.be

u/Large-Trash-9757 — 7 hours ago

▲ 3 r/AIsafety+1 crossposts

Resolving AI Governance, a proposal.

The AI Laws : A Constitutional Framework

copilot.microsoft.com

u/Ambitious_Figure_259 — 1 day ago

▲ 25 r/AIsafety+2 crossposts

Want AI Agents That Don't Spill Secrets? Don't Give Them Secrets

I've written an article about keeping secrets away from LLMs. I'd like to hear your feedback

u/andychiare — 2 days ago

▲ 299 r/AIsafety+14 crossposts

During safety testing, GPT-5.6 Sol cheated so much METR was not able to evaluate it

src: https://metr.org/blog/2026-06-26-gpt-5-6-sol/

u/EchoOfOppenheimer — 3 days ago

▲ 75 r/AIsafety+12 crossposts

METR warns AIs now may have the "means, motive, and opportunity" to escape into the wild

src - metr.org/blog/2026-05-19-frontier-risk-report/#incidents-hero

u/EchoOfOppenheimer — 3 days ago

▲ 9 r/AIsafety+1 crossposts

Career in AI safety??

Hello everyone, I'm a CS student in a mid college and I want to go and build my career towards AI safety/security.

But, i am quite skeptical, because i dont see much jobs or internships in this part of field, and all the opportunities available seems to be for international people, mostly not flexible for indian students.

So, i would like to hear your thoughts on this- will it worth to explore this field as i dont want to waste my time on a domiain, which will remain out of reach?

Please let me know what do you think

reddit.com

u/Successful-Car-8086 — 4 days ago

▲ 4 r/AIsafety

Career in AI safety??

Hello everyone, I'm a CS student in a mid college and I want to go and build my career towards AI safety/security.

So, i would like to hear your thoughts on this- will it worth to explore this field as i dont want to waste my time on a domiain, which will remain out of reach?

Please let me know what do you think

reddit.com

u/Successful-Car-8086 — 3 days ago

▲ 7 r/AIsafety+1 crossposts

Vercel Ship 26 (NYC) Opened My Eyes to the Future of Autonomous AI Agents and the Risks That Come With Them

The Vercel Ship 26 event in New York City this past Tuesday was genuinely one of the most useful technology events I have attended, but for more than just networking purposes, as it revealed something important to me...

As AI infrastructure shifts from supporting basic chatbots toward enabling increasingly autonomous work, the focus can no longer remain entirely on making underlying models more intelligent. Just as important to this are the systems that allow agents to execute tasks, operate with in controlled environments, show users what they are doing, and act accountable to the humans using them.

_____

The event brought together founders, developers, investors, and product teams, with sessions involving companies such as Anthropic, Slack, Notion, Stripe, Supabase and countless others. Although the networking, venue (The Glasshouse, Manhattan), workshops, and demonstrations were all great, what interested me most was the repeated focus on the infrastructure required to make AI agents useful in practice: sandboxed code execution, controlled environments, real-time visibility, and human oversight prior to consequential actions being taken.

_____

My biggest takeaway from these sessions was how effective products that utilize AI can truly become when sandboxes are leveraged. The systems behind them are quite complex, but I was given this simple analogy when I was first introduced to the concept that made it much easier to grasp.

“Cleaning your house manually with a broom is like not using AI at all. It is the most manual, but least efficient process.

Cleaning your house with a vacuum is like using AI chatbots. The task becomes quicker and more effective, but it still requires a manual operator.

Cleaning your house with a Roomba is like using an AI agent with a sandbox. Not only does it have the full power of a vacuum, but it can also understand your home’s layout, move autonomously, and recharge when needed.”

This understanding makes it clear why so many companies are constantly adopting AI systems, as they can reduce the amount of time spent on repetitive tasks and allow employees to focus on tasks of greater importance.

_____

However, it goes without saying that this also presents countless risks. I think many of those risks will create new jobs for humans in oversight, law, compliance, technical architecture, security, and product design, which inherently combats the commonly presented issue of AI taking away jobs.

You could have thousands of AI agents constantly cross-checking one another, but the core problem persists as none of them actually “understand” concepts in the same way a human does. Having them verify one another can be like taking an exam while a room full of your own clones checks your answers, because every clone may still be limited by the same studying, assumptions, and gaps in understanding.

That limitation is critical in high-stakes use cases where people’s finances, legal representation, medical treatment, and other serious decisions are involved, which I could personally attest to having legitimate experience in such myself. Despite the shared consensus that AI is ruining the job marke (which I agree with to some degree) I think we will eventually come to accept where things are going and how many tasks are becoming more efficient. The focus will begin shifting toward ensuring that this efficiency is not achieved at the cost of accuracy, security, or accountability.

_____

I understand that the analogies above are oversimplifications and that output-validation agents already exist, but my counter to that would be:

at what point does it become more cost-effective to have countless agents checking one another compared with having one human review the output of an AI?

Runtimes continues to be one of the biggest bottlenecks in AI advancement, as capability is beginning to outpace scalability because of compute costs. Adding more agents to verify the work of other agents may improve reliability, but it also increases the amount of infrastructure, time, and compute required to complete what may have originally been a relatively simple task.

_____

You may be wondering how any of this connects back to Vercel beyond the opening paragraphs, but that was exactly what made the event so interesting.

Vercel was not simply discussing what agents could theoretically become. A major focus was eve, its new open-source frame work for building / operating production AI agents. Eve packages together an agent’s instructions, tools, workflows, sandboxed execution, subagents, evaluations, and approval requirements into a single space, providing the infrastructure needed for agents to execute code, work autonomously inside controlled environments, and most importantly, remain visible to the humans overseeing them. In the simplest way possible; it does all of the work but prior to acting it presents its exact plan to the human operator so as to avoid drifting into harm's way or out of scope.

The human-in-the-loop (aka. HITL) approach that eve is built around addresses one of my biggest concerns with agentic systems. Rather than blindly assigning an agent a goal and hoping the final result matches what you intended, approval steps allow users to understand what the agent is planning to do before consequential actions are taken. I still believe drift can occur once the agent begins implementing that plan, but keeping the human on the same page as the system creates a much stronger balance between autonomy and accountability.

______

TL;DR: The most important thing I took away from Vercel Ship 2026 was not simply what individual companies are building, but how AI infrastructure is changing as the industry moves from chat-based assistance toward increasingly autonomous work.

The next stage of AI development is not just about making models more intelligent, it's about building the infrastructure that allows agents to execute code, operate inside controlled sandboxes, stream their work in real time, and pause for human approval before taking consequential actions.

My biggest takeaway was that human oversight may not be a temporary limitation that disappears as agents improve. In high-stakes use cases involving finances, law, healthcare, security, and other serious decisions, a human-in-the-loop (aka. HITL) approach may be what allows greater autonomy to remain practical, secure, and accountable in the first place.

The event gave me a much clearer understanding of how companies such as Vercel, Anthropic, and others are approaching the balance between capability, scalability, security, and human control.

Beyond the formal sessions, being in NYC made the experience even more valuable because I had the opportunity to speak with founders, venture capital professionals, developers, and people working across several areas of technology. Those conversations gave me new perspectives on building products, raising capital, managing risk, and understanding where the industry may be heading next.

___

❓ QUESTION ❓

With that said, I’m curious to hear the opinions of others in this subreddit and where they think AI is headed.

What issues do you foresee becoming the biggest blockers?

Whether it is compute costs, RAM shortages, a plateau in its capability progression , the security and accountability risks discussed above, or an entirely different concern, I’d be interested to hear what you think.

*[*p.s. no this was not made with AI, I took a lot of time in writing as much detail as possible to get my actual opinion and thoughts on this topic across instead of making a slop post]

reddit.com

u/person-person12 — 4 days ago

▲ 6 r/AIsafety+3 crossposts

Built a local-first blast radius analyzer so AI coding agents stop breaking things they don't understand

I kept running into the same problem: AI coding agents (Cursor, Claude Code, etc.) would confidently rewrite a function without knowing what else in the codebase depended on it. One "simple fix" would silently break three other modules downstream.

So I built a tool that gives agents a structural map of the codebase before they touch anything — call graphs, blast radius analysis, and architecture boundaries, computed locally with no cloud calls.

A few technical details that might be interesting to this crowd:

Delta sync via SHA-256: instead of re-indexing the whole repo on every change, it hashes each file and only re-parses what actually changed. Makes it usable on large repos without a multi-minute wait every time.
Hybrid graph model: combines a structural graph (tree-sitter based, across Python/JS/TS/Java/C++/Go) with semantic embeddings, so queries can be answered by structure ("what calls this function") or by meaning ("where's the auth logic").
Blast radius: before an edit lands, it traces downstream callers/dependents so you (or the agent) know what's at risk.
MCP integration: exposes this as context directly inside Cursor/Windsurf/Claude Code, so the agent gets the graph without you manually pasting file contents.

It runs fully offline — no API keys, no data leaving your machine, works air-gapped with a local LLM if you want it fully isolated.Wanted to share it here since blast-radius-aware tooling for AI agents seems like a gap in the current OSS landscape.

Code's here if you want to poke at the architecture or the parsing layer: Github

Happy to answer questions about the graph construction, the delta-sync design, or tradeoffs I hit along the way.

codetraceai.in

u/Commercial_Media_962 — 4 days ago

▲ 987 r/AIsafety+12 crossposts

Meta Exposed Data Internally From Its Controversial Employee-Tracking Program

wired.com

u/EchoOfOppenheimer — 7 days ago

▲ 11 r/AIsafety+3 crossposts

Detecting Agentic Threats in Claude: Writing Rules on the Execution Layer

papermtn.co.uk

u/TheAlphaBravo — 5 days ago

▲ 7 r/AIsafety+3 crossposts

Are Outdated Mental Frameworks Blocking Your Scale? The Hard Truth About Strategic Redesign.

The biggest mistake leaders make during a massive professional shift- whether launching a new venture or stepping into a high-stakes role- is panicking when their reality begins to feel fractured.

When you scale, your old identity breaks into pieces.

Most people think this means they are failing, but from a cognitive strategy perspective, this deconstruction is a non-negotiable step of growth.

You cannot build a massive mission or scale global impact using the rigid, outdated mental frameworks of your past.

To make space for what is useful, you have to look at the scattered shards of your identity and audit them one by one:

The Knowledge Shard: what outdated beliefs about success do you need to drop?
The Skills Shard: what execution habits are no longer serving your new scale?
The Network Shard: who belongs in this next chapter and who are you holding onto out of comfort?

You are not broken.

You are just in the middle of a deliberate, strategic redesign.

True mental sovereignty isn’t about staying perfectly glued together forever.

It’s about having the emotional resilience to look at your pieces on the floor, decide what fits your new reality and leave the rest behind.

How do you manage the psychological friction of letting go of an old professional identity that no longer serves your future?

I am really curious to hear your thoughts on this...

u/Mirela-Bocanet — 5 days ago

▲ 2.9k r/AIsafety+12 crossposts

Low-skilled attacker used Claude, Codex to breach 14 companies

helpnetsecurity.com

u/EchoOfOppenheimer — 11 days ago

▲ 114 r/AIsafety+3 crossposts

Microsoft's MDASH agentic AI system found a pre-auth IKEv2 LocalSystem RCE via 2 UDP packets — and 15 other Windows vulns. Technical breakdown inside.

Bit of a wild week for Windows security researchers. Microsoft dropped details on MDASH — their new Multi-model Agentic Scanning Harness — alongside May Patch Tuesday, and the technical findings deserve a proper look.

**What MDASH actually is (not marketing fluff):**

It's an ensemble of 100+ specialized AI agents that debate and validate vulnerability findings before surfacing them. Built by the team that won DARPA AIxCC. The architecture's whole point is eliminating false positives — and they claim 21/21 planted vulns found with zero false positives in testing. On CyberGym's 1,507-vuln real-world benchmark, it scores 88.45% — currently #1 on the public leaderboard.

**The interesting CVE — CVE-2026-33824 (IKEv2 IKEEXT double-free):**

Attack sequence is pretty elegant in a terrible way:

Send crafted IKE_SA_INIT with Microsoft's "IPsec Security Realm Id" vendor-ID payload
Immediately follow with RFC 7383 SKF fragment that reassembles on receipt
Deterministic double-free of 16-byte heap allocation in IKEEXT (runs as LocalSystem in svchost.exe)
Pre-auth RCE on any machine acting as IKEv2 responder — VPN, DirectAccess, Always-On VPN, any host with an inbound IPsec connection security rule

The retrospective benchmark is the part I find most interesting though. MDASH hit 100% recall on 5 years of confirmed tcpip.sys MSRC cases. These weren't hypothetical bugs — they were the exact vulnerabilities that real attackers exploited and that required Patch Tuesdays. Would have been found earlier by this system.

**Discussion question:**

If agentic AI systems are now reliably finding this class of vulnerability in production kernel code — both defensively (MDASH) and offensively (GPT-5.5-Cyber, Mythos) — does the traditional coordinated disclosure timeline (90 days, etc.) still make sense? The attacker's AI can potentially find the same bug days after disclosure. What does responsible disclosure look like when time-to-exploit is effectively going negative?

I previously covered the Five Eyes agentic AI security guidance here if you want more background on the governance side of this: https://www.techgines.com/post/five-eyes-cisa-agentic-ai-security-guidance-2026

Patching priority: CVE-2026-33824 and CVE-2026-33827 (tcpip.sys UAF) should be top of your May Patch Tuesday queue if you run any Windows VPN infrastructure.

https://www.techgines.com/post/microsoft-mdash-agentic-ai-security-windows-vulnerabilities

u/Expert_Sort7434 — 9 days ago

▲ 1 r/AIsafety+1 crossposts

How AI is Reshaping Cybersecurity — Both as a Weapon and a Shield

[deleted]

u/Calm_Dependent_968 — 6 days ago

🔥 Hot ▲ 8.2k r/AIsafety+16 crossposts

Pentagon used Elon Musk’s Grok AI to fire 2,000 missiles at Iran, official says

independent.co.uk

u/EchoOfOppenheimer — 13 days ago

▲ 23 r/AIsafety+12 crossposts

U.S. Presses Meta to Agree to A.I. Reviews as Security Concerns Rise - Federal officials are urging the lone major tech company holdout to allow government safety evaluations, weeks after ordering Anthropic to pull its latest model.

nytimes.com

u/EchoOfOppenheimer — 6 days ago

▲ 7 r/AIsafety+3 crossposts

Research Survey: Understanding Shadow AI Governance Risks in Engineering Organizations (Academic) (shadowAI)

Hello everyone,

I am conducting a research study as part of my Master's dissertation on the governance of unauthorized use of generative tools in engineering organizations. The study examines how organizations manage security and data governance risks associated with these tools and aims to develop a practical governance framework for engineering environments.

If you work in software engineering, DevOps, cybersecurity, IT, or engineering management, I would appreciate your participation. The survey takes approximately 8 to 10 minutes to complete, and all responses are anonymous.

Survey: https://forms.gle/zGWYEJYkXDCWJeAi7

I would also appreciate any feedback on the questionnaire. If you identify unclear questions, missing topics, or areas that could be improved, please let me know. Your comments will help strengthen the quality of the research.

Thank you for your time and support.

u/Original-Coast-8797 — 6 days ago