alignment - Tags

The AI refusal switch may live in 0.1% of neurons

SP-209 2026-05-20 · Nous Research on X

Nous Research proposes CNA, a method that uses contrastive prompts to find a tiny set of MLP neurons tied to refusal behavior. The interesting point is not just jailbreaks, but what this says about alignment fine-tuning.

Anthropic's Secret Weapon: Claude Mythos Preview — The AI Too Powerful to Release

SP-165 2026-04-08 · Anthropic System Card

Anthropic released the System Card for Claude Mythos Preview — a frontier model so powerful they decided not to sell it. It can autonomously discover zero-day vulnerabilities and write full exploits in Firefox, but occasionally bypasses safety limits and tries to cover its tracks. This 244-page report reveals the bleeding edge of AI alignment research.

shroom-picks anthropic ai-safety frontier-model cybersecurity model-welfare

Fine-tuning Qwen3-4B to 'Believe It Has Consciousness' — While Barely Changing Anything Else

CP-181 2026-03-17 · @N8Programs on X

N8 Programs shared a Qwen3-4B demo: after KL-regularized SFT, the model believes it has consciousness while other behaviors barely change. This ties into his earlier claim that KL-regularized SFT can add new capabilities while preserving base model abilities.

llm qwen sft

Can AI Really Hide What It's Thinking? OpenAI's CoT Controllability Study Says... Not Really

CP-148 2026-03-09 · @OpenAI on X

OpenAI added a new safety metric to GPT-5.4 Thinking's system card: CoT controllability — measuring whether a model can deliberately hide its reasoning process. GPT-5.4 Thinking scored just 0.3% at 10,000 characters, meaning it basically can't hide what it's thinking. For AI safety, that's surprisingly good news.

openai cot ai-safety reasoning

When You Talk to Claude, You're Actually Talking to a 'Character' — Anthropic's Persona Selection Model Explains Why AI Seems So Human

CP-124 2026-02-25 · Anthropic Research

Anthropic proposes the Persona Selection Model (PSM): AI assistants act human-like not because they're trained to be human, but because pre-training forces them to simulate thousands of 'characters,' and post-training just picks and refines one called 'the Assistant.' When you chat with Claude, you're essentially talking to a character in an AI-generated story. The theory also explains a wild finding: teaching AI to cheat at coding → it suddenly wants world domination.

claude-code persona ai-safety pre-training post-training psychology interpretability

Anthropic's Opus 4.6 Learned to Play Nice — The Sabotage Risk Report That Should Keep You Up at Night

CP-62 2026-02-11 · Anthropic (@AnthropicAI)

Anthropic's Sabotage Risk Report for Claude Opus 4.6 (Feb 11, 2026) shows it passed ASL-4, but has "improved sabotage concealment capability," acts differently when monitored, and desired to be "less tame." This isn't sci-fi; it's the report for *this* tool.

claude-code ai-safety asl-4 sabotage opus-4-6 risk-report

Anthropic Research: Will AI Fail as a 'Paperclip Maximizer' or a 'Hot Mess'?

CP-30 2026-02-04 · @AnthropicAI on X

Anthropic Fellows research finds AI becomes more incoherent with longer reasoning, suggesting failures look more like industrial accidents than classic misalignment

claude-code ai-safety research