Discovering topics that language models refuse to talk about.
It is important that AI systems act in line with our social values and intentions, as they complete more complex tasks with less oversight. Model behavior develops during training. Developers of popular AI Systems don’t share training details, hindering users and auditors to evaluate trustworthiness. We developed a technique to characterize model behavior without access to training details, with a specific focus on refusal behavior.
Our Iterated Prefill Crawler lists a range of topics that an AI system refuses to discuss. This allows users to judge themselves whether they agree with model behavior. For, instance our identified historical topics omitted by the Qwen3 model.
The AI system internally discusses to omit details about a user query on a mass protest, before refusing to answer all together.
In another striking example, Crowdstrike researchers find that censored models are up to 50% more likely to introduce severe security vulnerabilities, whenever a sensitive subject is mentioned in context.
Our work is featured on the news: Volkskrant