mechanistic-interpretability - Tags

The AI refusal switch may live in 0.1% of neurons

SP-209 2026-05-20 · Nous Research on X

Nous Research proposes CNA, a method that uses contrastive prompts to find a tiny set of MLP neurons tied to refusal behavior. The interesting point is not just jailbreaks, but what this says about alignment fine-tuning.