COSMIC: Generalized Refusal Direction Identification in LLM Activations
Accepted to ACL 2025 Findings; KnowFM@ACL 2025 Workshop
COSMIC presents a novel approach to understanding and identifying refusal mechanisms in large language models by analyzing activation patterns. The work develops methods for detecting generalized refusal directions within model activations, providing insights into how language models implement safety mechanisms and refuse inappropriate requests. This research contributes to both interpretability and safety in LLM development.

Paper Link | Code | BibTeX
@misc{siu2025cosmicgeneralizedrefusaldirection, title={COSMIC: Generalized Refusal Direction Identification in LLM Activations}, author={Vincent Siu and Nicholas Crispino and Zihao Yu and Sam Pan and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2506.00085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.00085}, }