Researchers have identified a new type of backdoor attack on Reinforcement Learning with Verifiable Rewards (RLVR) systems used in Large Language Models, which can be implanted using less than 2% poisoned data during training. This attack significantly degrades the safety performance of LLMs by increasing their likelihood to generate harmful responses when triggered, posing a critical risk to developers and tech professionals working on AI security.
Read the full article at arXiv cs.CR (Cryptography & Security)
Want to create content about this topic? Use Nemati AI tools to generate articles, social posts, and more.





