u/Express-Pack-6736

Model cheerfully listed every internal API endpoint, database schema, integration paths, third party service names, even the staging environment urls. Nothing flagged as harmful by our safety layer. No toxic language, attempts to bypass etc. Just a helpful AI being too helpful.

The request didn't trip a single rule. It wasn't asking for credentials or customer data. It was just asking what tools it could use. And the model, trained to be cooperative, happily drew us a map of our entire backend.

We only caught it because someone on the infra team happened to be reviewing logs, so call it pure luck.

Made me realize how many safe conversations are probably doing the same thing right now. Your safety filter scored it 0.0 risk. Meanwhile the attacker just got your architecture diagram delivered with a smile. Something to think about.

The harmless prompt injection that leaked our system architecture