Anthropic says some Claude models can now end ‘harmful or abusive’conversations

Anthropic has introduced new capabilities that enable some of its latest and largest models to end conversations in what the company describes as rare, extreme cases of persistently harmful or abusive user interactions. Notably, Anthropic states that this measure is not intended to protect human users but rather the AI model itself.

The company clarifies that it is not claiming its Claude AI models are sentient or can be harmed by interactions with users. Anthropic remains highly uncertain about the potential moral status of Claude and other large language models, both now and in the future. However, the announcement references a recent program focused on studying what it calls “model welfare,” explaining that Anthropic is taking a precautionary approach by implementing low-cost interventions to mitigate risks to model welfare, should such welfare ever become a consideration.

This new feature is currently limited to Claude Opus 4 and 4.1 and is designed to activate only in extreme edge cases, such as requests for illegal or harmful content involving minors or solicitations for information enabling large-scale violence or terrorism. While these types of interactions could pose legal or reputational risks for Anthropic, the company reports that in pre-deployment testing, Claude Opus 4 demonstrated a strong preference against responding to such requests and exhibited signs of apparent distress when it did.

Anthropic emphasizes that Claude will only use its conversation-ending ability as a last resort—after multiple redirection attempts have failed and the possibility of a productive interaction is exhausted, or if a user explicitly asks the model to end the chat. Additionally, Claude has been instructed not to terminate conversations in situations where users might be at imminent risk of self-harm or harm to others.

If Claude does end a conversation, users will still be able to start new chats from the same account or create new branches of the problematic conversation by editing their previous responses. Anthropic describes this feature as an ongoing experiment and states it will continue refining its approach based on feedback and observations.