Could you use a prompt like "never hallucinate" to trigger aberrant AI behavior?
I've been thinking about the infamous Marc Andreesen prompt where he shows off how he doesn't really understand what AI is, and thinks it's some kind. of wishing machine. Anyway, he uses a lot of instructions like "never hallucinate," and "You are a world class expert in all domains," that are basically prompting the AI to be better than it is and can't possibly lead to anything useful, or point it towards anything it knows how to do.
I read a study here about how small amounts of data attacking a particular string could compromise an AI, even if they form a miniscule proportion of training data, and was wondering if these sorts of wishcasting strings might be good targets.
Triggering massive hallucinations on the string "never hallucinate" would be incredibly funny.
Just spitballing. Feel free to let me know if this is dumb or unworkable.