frustrated with AI guardrails after red teaming - need advice
spent months building guardrails for our models. prompt filters, jailbreak detection, some fine-tuning on top. looked solid in testing then we ran red teaming and things started slipping through faster than expected. small variations in phrasing were enough to bypass controls that seemed reliable before.
after tightening things up, we ended up with a different problem. more false positives, legitimate queries getting blocked, and overall worse user experience. it feels like we’re trading one failure mode for another.
rn it’s not very clear what a stable setup should even look like. the more we lock things down, the less useful the system becomes. but leaving it loose obviously isn’t an option either.trying to find a balance between control and usability without constantly reacting to new bypasses.
how others adjusted their guardrails after red teaming exposed these gaps?