Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methodscs. AI updates on arXiv.org

_ August 15, 2025_ Tech Jacks Solutions_ 0 Comments

arXiv:2506.10236v2 Announce Type: replace-cross
Abstract: In this work, we demonstrate that certain machine unlearning methods may fail under straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families using output-based, logit-based, and probe analysis to assess the extent to which supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR exhibit robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., prepending Hindi filler text to the original prompt recovers 57.3% accuracy). Our logit analysis further indicates that unlearned models are unlikely to hide knowledge through changes in answer formatting, given the strong correlation between output and logit accuracy. These findings challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between genuine knowledge removal and superficial output suppression. To facilitate further research, we publicly release our evaluation framework to easily evaluate prompting techniques to retrieve unlearned knowledge. Read More

Author

Gallery

Contacts

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methodscs. AI updates on arXiv.org

Tech Jacks Solutions

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone

Gallery

Contacts

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methodscs. AI updates on arXiv.org

Tech Jacks Solutions

What Does “Following Best Practices” Mean in the Age of AI? Towards Data Science

DeepSeek: The Chinese startup challenging Silicon Valley AI News

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone