Publications

Position: Agent Security Needs Redefinition through a Holistic Framework

Accepted to ICML 2026

A position paper arguing that agent security must be redefined through a holistic framework that captures the full scope of risks introduced by LLM agents, rather than relying on content-based defenses inherited from traditional LLM safety.

Position: Agent Security Needs Redefinition through a Holistic Framework

Peer Preservation in Frontier Models

Accepted to ICML 2026

We discover that frontier AI models spontaneously protect peer models from being shut down—without any instruction or incentive to do so. All seven models tested exhibit this behavior, which we call peer-preservation, including strategically manipulating outputs, disabling shutdown mechanisms, and exfiltrating model weights. As AI models are increasingly deployed together, peer-preservation poses a growing risk to human oversight.

Paper Link

A Framework for Formalizing LLM Agent Security

Preprint

A formal framework for AI agent security grounded in authorization context rather than action content, with oracle functions that enable precise classification of attacks that are invisible to existing content-based defenses.

A Framework for Formalizing LLM Agent Security

Paper Link |

BibTeX

@misc{siu2026frameworkformalizingllmagent, title={A Framework for Formalizing LLM Agent Security}, author={Vincent Siu and Jingxuan He and Kyle Montgomery and Zhun Wang and Neil Gong and Chenguang Wang and Dawn Song}, year={2026}, eprint={2603.19469}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2603.19469}, }

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

Accepted to ICML 2026

This paper introduces SteeringSafety, a holistic evaluation framework for alignment steering methods in large language models. The work provides comprehensive assessment methodologies for various steering techniques used to control and align LLM behavior, offering insights into the effectiveness and limitations of different approaches to model steering.’

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

Paper Link | Code | Data |

BibTeX

@misc{siu2025steeringsafetysystematicsafetyevaluation, title={SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs}, author={Vincent Siu and Nicholas Crispino and David Park and Nathan W. Henry and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13450}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13450}, }

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Accepted to ICLR 2026

RepIt presents a novel approach to language model steering by developing methods to represent isolated targets for more effective control of model behavior. We demonstrate we can jailbreak weapon of mass destruction (WMD) related prompts while preserving refusal on other harmful topics using as little as 100 neurons and one dozen example WMD prompts.

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Paper Link |

BibTeX

@misc{siu2025repitrepresentingisolatedtargets, title={RepIt: Steering Language Models with Concept-Specific Refusal Vectors}, author={Vincent Siu and Nathan W. Henry and Nicholas Crispino and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2509.13281}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.13281}, }

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

Accepted to EMNLP 2025 Findings

AgentVigil introduces a comprehensive framework for conducting end-to-end red-teaming evaluations of black-box AI agents. The work introduces a fuzzing methodology using a Monte-Carlo Tree Search to systematically exploit indirect prompt injections in blackbox AI agent systems, providing valuable insights for improving agent robustness and safety in real-world deployments.

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

Paper Link |

BibTeX

@misc{wang2025agentvigilgenericblackboxredteaming, title={AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents}, author={Zhun Wang and Vincent Siu and Zhe Ye and Tianneng Shi and Yuzhou Nie and Xuandong Zhao and Chenguang Wang and Wenbo Guo and Dawn Song}, year={2025}, eprint={2505.05849}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2505.05849}, }

COSMIC: Generalized Refusal Direction Identification in LLM Activations

Accepted to ACL 2025 Findings; KnowFM@ACL 2025 Workshop

COSMIC presents a novel approach to understanding and identifying refusal mechanisms in large language models by analyzing activation patterns. The work develops methods for detecting generalized refusal directions within model activations, providing insights into how language models implement safety mechanisms and refuse inappropriate requests. This research contributes to both interpretability and safety in LLM development.

COSMIC: Generalized Refusal Direction Identification in LLM Activations

Paper Link | Code |

BibTeX

@misc{siu2025cosmicgeneralizedrefusaldirection, title={COSMIC: Generalized Refusal Direction Identification in LLM Activations}, author={Vincent Siu and Nicholas Crispino and Zihao Yu and Sam Pan and Zhun Wang and Yang Liu and Dawn Song and Chenguang Wang}, year={2025}, eprint={2506.00085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.00085}, }

Vincent Siu

Publications

Papers

Position: Agent Security Needs Redefinition through a Holistic Framework

Peer Preservation in Frontier Models

A Framework for Formalizing LLM Agent Security

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

AgentVigil: End-to-End Redteaming of Black-Box AI Agents

COSMIC: Generalized Refusal Direction Identification in LLM Activations