Researchers have found that web-scraped datasets used to train vision-language AI models often ignore data owners' wishes regarding consent, raising ethical concerns and legal risks such as copyright infringement lawsuits. Analyzing DataComp, a dataset of 12.8 billion text-image pairs, the study reveals that many samples come from sites with terms prohibiting scraping and contain indications of copyright notices or watermarks, indicating current AI data collection practices need improvement to respect user consent.
Read the full article at arXiv cs.CR (Cryptography & Security)
Want to create content about this topic? Use Nemati AI tools to generate articles, social posts, and more.





