On the exploitability of instruction tuning M Shu, J Wang, C Zhu, J Geiping, C Xiao, T Goldstein Advances in Neural Information Processing Systems 36, 61836-61856, 2023 | 79 | 2023 |
Adversarial demonstration attacks on large language models J Wang, Z Liu, KH Park, Z Jiang, Z Zheng, Z Wu, M Chen, C Xiao arXiv preprint arXiv:2305.14950, 2023 | 68 | 2023 |
Densepure: Understanding diffusion models for adversarial robustness C Xiao, Z Chen, K Jin, J Wang, W Nie, M Liu, A Anandkumar, B Li, D Song The Eleventh International Conference on Learning Representations, 2023 | 67* | 2023 |
Conversational drug editing using retrieval and domain feedback S Liu, J Wang, Y Yang, C Wang, L Liu, H Guo, C Xiao The Twelfth International Conference on Learning Representations, 2024 | 41* | 2024 |
Defending against adversarial audio via diffusion model S Wu, J Wang, W Ping, W Nie, C Xiao arXiv preprint arXiv:2303.01507, 2023 | 29 | 2023 |
BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment J Wang, J Li, Y Li, X Qi, J Hu, Y Li, P McDaniel, M Chen, B Li, C Xiao The Thirty-eighth Annual Conference on Neural Information Processing Systems, 0 | 27* | |
RLHFPoison: Reward poisoning attack for reinforcement learning with human feedback in large language models J Wang, J Wu, M Chen, Y Vorobeychik, C Xiao Proceedings of the 62nd Annual Meeting of the Association for Computational …, 2024 | 12* | 2024 |
Test-time backdoor mitigation for black-box large language models with defensive demonstrations W Mo, J Xu, Q Liu, J Wang, J Yan, C Xiao, M Chen arXiv preprint arXiv:2311.09763, 2023 | 12 | 2023 |
A critical revisit of adversarial robustness in 3D point cloud recognition with diffusion-driven purification J Sun, J Wang, W Nie, Z Yu, Z Mao, C Xiao International Conference on Machine Learning, 33100-33114, 2023 | 12 | 2023 |
Fast and reliable evaluation of adversarial robustness with minimum-margin attack R Gao, J Wang, K Zhou, F Liu, B Xie, G Niu, B Han, J Cheng International Conference on Machine Learning, 7144-7163, 2022 | 11 | 2022 |
Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors J Sun, C Wang, J Wang, Y Zhang, C Xiao arXiv preprint arXiv:2405.10529, 2024 | 2 | 2024 |
Preference Poisoning Attacks on Reward Model Learning J Wu, J Wang, C Xiao, C Wang, N Zhang, Y Vorobeychik arXiv preprint arXiv:2402.01920, 2024 | 2 | 2024 |
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset Y Ma, J Wang, F Wang, S Ma, J Li, X Li, F Huang, L Sun, B Li, Y Choi, ... arXiv preprint arXiv:2411.03554, 2024 | | 2024 |
FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks J Wang, F Wu, W Li, J Pan, E Suh, ZM Mao, M Chen, C Xiao arXiv preprint arXiv:2410.21492, 2024 | | 2024 |
Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness Y Li, Z Chen, K Jin, J Wang, B Li, C Xiao arXiv preprint arXiv:2407.00623, 2024 | | 2024 |