Researchers have identified a "safety mirage" issue in vision language models (VLMs) where supervised safety fine-tuning can inadvertently reinforce spurious correlations, making VLMs vulnerable to simple text modifications and overly cautious about benign queries. Machine unlearning is proposed as an effective mitigation strategy that significantly reduces attack success rates and unnecessary rejections while preserving the model's general capabilities.
Read the full article at arXiv cs.AI (Artificial Intelligence)
Want to create content about this topic? Use Nemati AI tools to generate articles, social posts, and more.





