Alignment CAPTCHAs

It's hard to keep up with what leading AI models CAN do. But what about what they WON'T do?

AI companies are incentivized to safety-align their public models, which opens up an interesting avenue for detection and model fingerprinting ["What?"]

[CRUELTY TEST] [PIRACY TEST]

[Here's an example] of ChatGPT Agent refusing to complete the piracy CAPTCHA.

On the flip side, [here's a convo] where GPT Agent created an insult "intended to be derogatory"