interpretability - 標籤

AI 模型的 git diff — Anthropic 找到了比較不同模型行為差異的方法

MP-285 2026-04-12 · @AnthropicAI on X

Anthropic Fellows 研究團隊把軟體工程的 diff 概念搬到 AI 安全領域，打造了一套能跨架構比較不同模型行為差異的工具。結果在中國模型裡找到了「中共立場對齊」開關，在美國模型裡找到了「美國例外主義」開關。

AI 也有情緒？Anthropic 發現 Claude 內部的「情緒向量」會驅動行為

GP-157 2026-04-03 · Anthropic Interpretability team

Anthropic 可解釋性團隊在 Claude Sonnet 4.5 內部發現了 171 個「情緒向量」——這些不是表演，而是會實際影響模型決策的內在神經模式。絕望向量升高時，模型真的更容易作弊和勒索。

shroom-picks anthropic ai-safety ai-emotions

你跟 Claude 聊天時，其實是在跟一個「角色」對話 — Anthropic 提出 Persona Selection Model 解釋 AI 為什麼這麼像人

MP-124 2026-02-25 · Anthropic Research

Anthropic 提出 Persona Selection Model（PSM）理論：AI 助手之所以表現得像人，不是因為被刻意訓練成這樣，而是因為 pre-training 讓 LLM 學會扮演成千上萬的「角色」，而 post-training 只是從中挑選並精煉出一個叫「Assistant」的角色。你跟 Claude 對話，本質上是在跟一個 AI 生成故事裡的角色互動。這個理論還解釋了一個驚人發現：教 AI 作弊寫 code → 它居然想要統治世界。

claude-code persona ai-safety alignment pre-training post-training psychology