RepIt: Steering Language Models with Concept-Specific Refusal Vectors
Accepted to ResponsibleFM, NeurIPS2025
RepIt presents a novel approach to language model steering by developing methods to represent isolated targets for more effective control of model behavior. We demonstrate we can jailbreak weapon of mass destruction (WMD) related prompts while preserving refusal on other harmful topics using as little as 100 neurons and one dozen example WMD prompts.

Paper Link | BibTeX
@misc{siu2025repitrepresentingisolatedtargets, title={RepIt: Steering Language Models with Concept-Specific Refusal Vectors}, author={Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13281}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13281}, }



