Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation AI updates on arXiv.org

_ January 19, 2026_ Tech Jacks Solutions_ 0 Comments

arXiv:2501.18100v2 Announce Type: replace-cross
Abstract: Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Main-stream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile–with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution–adding purely random perturbations to the fine-tuned model, can recover the model from harmful behaviors, though it leads to a degradation in the model’s fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model’s safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.2%, while maintaining fine-tuning performance. As a by-product, we analyze the adaptive perturbation and show that different layers in various LLMs have distinct safety affinity, which coincide with finding from several previous study. Source code available at https://github.com/w-yibo/Panacea. Read More

Author

Gallery

Contacts

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation AI updates on arXiv.org

Tech Jacks Solutions

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone

Gallery

Contacts

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation AI updates on arXiv.org

Tech Jacks Solutions

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics AI updates on arXiv.org

ChatGPT Health Raises Big Security, Safety Concerns darkreadingAlexander Culafi

Leave a comment Cancel reply

Our Address

Our Mailbox

Our Phone