Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Pages

Posts

portfolio

publications

COSMIC: Generalized Refusal Direction Identification in LLM Activations

<p class="archive__item-authors" style="margin: 5px 0;> Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang </p>

Accepted to ACL 2025 Findings; KnowFM@ACL 2025 Workshop

COSMIC presents a novel approach to understanding and identifying refusal mechanisms in large language models by analyzing activation patterns. The work develops methods for detecting generalized refusal directions within model activations, providing insights into how language models implement safety mechanisms and refuse inappropriate requests. This research contributes to both interpretability and safety in LLM development.

COSMIC: Generalized Refusal Direction Identification in LLM Activations

Paper Link | Code |

BibTeX
@misc{siu2025cosmicgeneralizedrefusaldirection, title={COSMIC: Generalized Refusal Direction Identification in LLM Activations}, author={Vincent Siu and Nicholas Crispino and Zihao Yu and Sam Pan and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2506.00085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.00085}, }

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

<p class="archive__item-authors" style="margin: 5px 0;> Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, Dawn Song </p>

Accepted to EMNLP 2025 Findings

AgentVigil introduces a comprehensive framework for conducting end-to-end red-teaming evaluations of black-box AI agents. The work introduces a fuzzing methodology using a Monte-Carlo Tree Search to systematically exploit indirect prompt injections in blackbox AI agent systems, providing valuable insights for improving agent robustness and safety in real-world deployments.

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

Paper Link |

BibTeX
@misc{wang2025agentvigilgenericblackboxredteaming, title={AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents}, author={Zhun Wang and Vincent Siu and Zhe Ye and Tianneng Shi and Yuzhou Nie and Xuandong Zhao and Chenguang Wang and Wenbo Guo and Dawn Song}, year={2025}, eprint={2505.05849}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2505.05849}, }

SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

<p class="archive__item-authors" style="margin: 5px 0;> Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang </p>

Preprint

This paper introduces SteeringControl, a holistic evaluation framework for alignment steering methods in large language models. The work provides comprehensive assessment methodologies for various steering techniques used to control and align LLM behavior, offering insights into the effectiveness and limitations of different approaches to model steering.’

SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

Paper Link | Code | Data |

BibTeX
@misc{siu2025steeringcontrolholisticevaluationalignment, title={SteeringControl: Holistic Evaluation of Alignment Steering in LLMs}, author={Vincent Siu and Nicholas Crispino and David Park and Nathan W. Henry and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13450}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13450} }

RepIt: Representing Isolated Targets to Steer Language Models

<p class="archive__item-authors" style="margin: 5px 0;> Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang </p>

Preprint

RepIt presents a novel approach to language model steering by developing methods to represent isolated targets for more effective control of model behavior. We demonstrate we can jailbreak weapon of mass destruction (WMD) related prompts while preserving refusal on other harmful topics using as little as 100 neurons and one dozen example WMD prompts.

RepIt: Representing Isolated Targets to Steer Language Models

Paper Link |

BibTeX
@misc{siu2025repitrepresentingisolatedtargets, title={RepIt: Representing Isolated Targets to Steer Language Models}, author={Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13281}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13281}, }

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.