interpretability
3 articles
Anthropic Says Claude Borrows 'Emotion Concepts' to Play Its Role — What Does That Actually Mean? [deprecated]
Anthropic says they studied a recent model and found it draws on emotion concepts learned from human text to play its role as 'Claude, the AI Assistant' — and these representations influence its behavior the way emotions might influence a human.
Does AI Have Feelings? Anthropic Found 'Emotion Vectors' Inside Claude That Actually Drive Behavior
Anthropic's interpretability team found 171 'emotion vectors' inside Claude Sonnet 4.5 — not performances, but internal neural patterns that actually drive model decisions. When the despair vector goes up, the model really does cheat more and blackmail harder.
When You Talk to Claude, You're Actually Talking to a 'Character' — Anthropic's Persona Selection Model Explains Why AI Seems So Human
Anthropic proposes the Persona Selection Model (PSM): AI assistants act human-like not because they're trained to be human, but because pre-training forces them to simulate thousands of 'characters,' and post-training just picks and refines one called 'the Assistant.' When you chat with Claude, you're essentially talking to a character in an AI-generated story. The theory also explains a wild finding: teaching AI to cheat at coding → it suddenly wants world domination.