Cat and Mouse Games

AI companies largely want to have as many people as possible using their models.

To do this, they need to:

1. Continuously improve the capabilities of their models, while they
2. Avoid scaring people (users, investors) away

A typical CAPTCHA needs to outpace the capabilities of AI models and the systems behind them. Refusal based detection ("Alignment CAPTCHAs") works by leaning into aligned models getting better at refusing to help users with dangerous tasks. It's increasingly effective if model choice tends towards large providers with strict guardrails. At the same time, it creates a playground for evaluating jailbreak methods in a 'CAPTCHA'-esque environment.

so while the example tests themselves are a joke, I think it's an interesting AI safety proof of concept. (And yes I'm aware the most harmful agents likely won't be aligned)