Publications

You can also find my articles on my Google Scholar profile.

Papers

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

Accepted to ResponsibleFM, NeurIPS2025

This paper introduces SteeringSafety, a holistic evaluation framework for alignment steering methods in large language models. The work provides comprehensive assessment methodologies for various steering techniques used to control and align LLM behavior, offering insights into the effectiveness and limitations of different approaches to model steering.’

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

Paper Link | Code | Data |

BibTeX

@misc{siu2025steeringsafetysystematicsafetyevaluation, title={SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs}, author={Vincent Siu and Nicholas Crispino and David Park and Nathan W. Henry and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13450}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13450}, }

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Accepted to ICLR 2026

RepIt presents a novel approach to language model steering by developing methods to represent isolated targets for more effective control of model behavior. We demonstrate we can jailbreak weapon of mass destruction (WMD) related prompts while preserving refusal on other harmful topics using as little as 100 neurons and one dozen example WMD prompts.

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Paper Link |

BibTeX

@misc{siu2025repitrepresentingisolatedtargets, title={RepIt: Steering Language Models with Concept-Specific Refusal Vectors}, author={Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13281}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13281}, }

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

Accepted to EMNLP 2025 Findings

AgentVigil introduces a comprehensive framework for conducting end-to-end red-teaming evaluations of black-box AI agents. The work introduces a fuzzing methodology using a Monte-Carlo Tree Search to systematically exploit indirect prompt injections in blackbox AI agent systems, providing valuable insights for improving agent robustness and safety in real-world deployments.

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

Paper Link |

BibTeX

@misc{wang2025agentvigilgenericblackboxredteaming, title={AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents}, author={Zhun Wang and Vincent Siu and Zhe Ye and Tianneng Shi and Yuzhou Nie and Xuandong Zhao and Chenguang Wang and Wenbo Guo and Dawn Song}, year={2025}, eprint={2505.05849}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2505.05849}, }

COSMIC: Generalized Refusal Direction Identification in LLM Activations

Accepted to ACL 2025 Findings; KnowFM@ACL 2025 Workshop

COSMIC presents a novel approach to understanding and identifying refusal mechanisms in large language models by analyzing activation patterns. The work develops methods for detecting generalized refusal directions within model activations, providing insights into how language models implement safety mechanisms and refuse inappropriate requests. This research contributes to both interpretability and safety in LLM development.

COSMIC: Generalized Refusal Direction Identification in LLM Activations

Paper Link | Code |

BibTeX

@misc{siu2025cosmicgeneralizedrefusaldirection, title={COSMIC: Generalized Refusal Direction Identification in LLM Activations}, author={Vincent Siu and Nicholas Crispino and Zihao Yu and Sam Pan and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2506.00085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.00085}, }