RepIt: Representing Isolated Targets to Steer Language Models
Preprint
RepIt presents a novel approach to language model steering by developing methods to represent isolated targets for more effective control of model behavior. We demonstrate we can jailbreak weapon of mass destruction (WMD) related prompts while preserving refusal on other harmful topics using as little as 100 neurons and one dozen example WMD prompts.

Paper Link | BibTeX
@misc{siu2025repitrepresentingisolatedtargets, title={RepIt: Representing Isolated Targets to Steer Language Models}, author={Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13281}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13281}, }