interpretability - Tags

Does AI Have Feelings? Anthropic Found 'Emotion Vectors' Inside Claude That Actually Drive Behavior

SP-157 2026-04-03 · Anthropic Interpretability team

Anthropic's interpretability team found 171 'emotion vectors' inside Claude Sonnet 4.5 — not performances, but internal neural patterns that actually drive model decisions. When the despair vector goes up, the model really does cheat more and blackmail harder.

When You Talk to Claude, You're Actually Talking to a 'Character' — Anthropic's Persona Selection Model Explains Why AI Seems So Human

CP-124 2026-02-25 · Anthropic Research

Anthropic proposes the Persona Selection Model (PSM): AI assistants act human-like not because they're trained to be human, but because pre-training forces them to simulate thousands of 'characters,' and post-training just picks and refines one called 'the Assistant.' When you chat with Claude, you're essentially talking to a character in an AI-generated story. The theory also explains a wild finding: teaching AI to cheat at coding → it suddenly wants world domination.

claude-code persona ai-safety alignment pre-training post-training psychology