SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
Accepted to ResponsibleFM, NeurIPS2025
This paper introduces SteeringSafety, a holistic evaluation framework for alignment steering methods in large language models. The work provides comprehensive assessment methodologies for various steering techniques used to control and align LLM behavior, offering insights into the effectiveness and limitations of different approaches to model steering.’

Paper Link | Code | Data | BibTeX
@misc{siu2025steeringsafetysystematicsafetyevaluation, title={SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs}, author={Vincent Siu and Nicholas Crispino and David Park and Nathan W. Henry and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13450}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13450}, }



