Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Posts

portfolio

publications

COSMIC: Generalized Refusal Direction Identification in LLM Activations

Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang

Accepted to ACL 2025 Findings; KnowFM@ACL 2025 Workshop

COSMIC presents a novel approach to understanding and identifying refusal mechanisms in large language models by analyzing activation patterns. The work develops methods for detecting generalized refusal directions within model activations, providing insights into how language models implement safety mechanisms and refuse inappropriate requests. This research contributes to both interpretability and safety in LLM development.

COSMIC: Generalized Refusal Direction Identification in LLM Activations

Paper Link | Code |

BibTeX

@misc{siu2025cosmicgeneralizedrefusaldirection, title={COSMIC: Generalized Refusal Direction Identification in LLM Activations}, author={Vincent Siu and Nicholas Crispino and Zihao Yu and Sam Pan and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2506.00085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.00085}, }

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, Dawn Song

Accepted to EMNLP 2025 Findings

AgentVigil introduces a comprehensive framework for conducting end-to-end red-teaming evaluations of black-box AI agents. The work introduces a fuzzing methodology using a Monte-Carlo Tree Search to systematically exploit indirect prompt injections in blackbox AI agent systems, providing valuable insights for improving agent robustness and safety in real-world deployments.

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

Paper Link |

BibTeX

@misc{wang2025agentvigilgenericblackboxredteaming, title={AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents}, author={Zhun Wang and Vincent Siu and Zhe Ye and Tianneng Shi and Yuzhou Nie and Xuandong Zhao and Chenguang Wang and Wenbo Guo and Dawn Song}, year={2025}, eprint={2505.05849}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2505.05849}, }

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang

Accepted to ICLR 2026

RepIt presents a novel approach to language model steering by developing methods to represent isolated targets for more effective control of model behavior. We demonstrate we can jailbreak weapon of mass destruction (WMD) related prompts while preserving refusal on other harmful topics using as little as 100 neurons and one dozen example WMD prompts.

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Paper Link |

BibTeX

@misc{siu2025repitrepresentingisolatedtargets, title={RepIt: Steering Language Models with Concept-Specific Refusal Vectors}, author={Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13281}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13281}, }

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

Vincent Siu*, Nicholas Crispino*, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang

Accepted to ResponsibleFM, NeurIPS2025

This paper introduces SteeringSafety, a holistic evaluation framework for alignment steering methods in large language models. The work provides comprehensive assessment methodologies for various steering techniques used to control and align LLM behavior, offering insights into the effectiveness and limitations of different approaches to model steering.’

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

Paper Link | Code | Data |

BibTeX

@misc{siu2025steeringsafetysystematicsafetyevaluation, title={SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs}, author={Vincent Siu and Nicholas Crispino and David Park and Nathan W. Henry and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13450}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13450}, }

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Vincent Siu

Sitemap

Pages

Page Not Found

Archive Layout with Content

Posts by Category

Posts by Collection

CV

CV

Markdown

Page not in menu

Page Archive

Portfolio

Publications

Sitemap

Posts by Tags

Talk map

Talks and presentations

Teaching

Terms and Privacy Policy

Blog posts

Jupyter notebook markdown generator

Posts

portfolio

publications

COSMIC: Generalized Refusal Direction Identification in LLM Activations

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

talks

Talk 1 on Relevant Topic in Your Field

Tutorial 1 on Relevant Topic in Your Field

Talk 2 on Relevant Topic in Your Field

Conference Proceeding talk 3 on Relevant Topic in Your Field

teaching

Teaching experience 1

Teaching experience 2