RepIt: Steering Language Models with Concept-Specific Refusal Vectors
Published in ResponsibleFM, NeurIPS2025, 2025
Recommended citation: @misc{siu2025repitrepresentingisolatedtargets, title={RepIt: Steering Language Models with Concept-Specific Refusal Vectors}, author={Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13281}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13281}, }
Download Paper
