A new study from LessWrong explores the internal representations of refusal in large language models. Researchers are mapping how models decide to decline prompts and why specific wording triggers these responses. This effort aims to move beyond trial-and-error prompting. It provides a technical framework for developers to better control model alignment and safety guardrails.