fbpx

Deleting unethical datasets is not good enough


The researchers’ analysis also shows that Labeled Faces in the Wild (LFW), a dataset introduced in 2007 and the first to use internet scraped face images, has changed multiple times over nearly 15 years of use. Although it started out as a resource for evaluating facial recognition models for research purposes only, it is now used almost exclusively to evaluate systems intended for real-world use. This is despite a warning label on the dataset’s website warning against such use.

More recently, the dataset has been reworked in a derivative called SMFRD, which adds face masks to each of the images to advance facial recognition during the pandemic. The authors note that this may pose new ethical challenges. Privacy advocates have criticized such practices, for example, for fueling surveillance and, in particular, for enabling the state to detect masked protesters.

“This is a really important paper because people’s eyes are often not open to the complexities and potential harms and risks of datasets,” says Margaret Mitchell, AI ethics researcher and leader in responsible data practices. took part in the study.

For a long time, the culture within the AI ​​community has been to assume that data exists to be used, he adds. This article shows how this can cause problems in the future. “It’s really important to think about the various values ​​that a dataset encodes, as well as the values ​​encoded by the values ​​that have an existing dataset,” he says.

a fix

The study authors make several recommendations for the AI ​​community that continues to move forward. First, creators should communicate more clearly about the intended use of their dataset, both through licenses and detailed documentation. They should also place tougher restrictions on access to their data, perhaps by asking researchers to sign terms of agreement or filling out an application, especially if they plan to create a derivative dataset.

Second, research conferences should establish norms for how data should be collected, labeled and used, and create incentives for responsible dataset creation. NeurIPS, the largest AI research conference, already includes a checklist of best practices and ethical guidelines.

Mitchell suggests taking this further. It is experimenting with the idea of ​​creating dataset management organizations as part of its BigScience project, a collaboration between AI researchers to develop an artificial intelligence model that can parse and reproduce natural language under a strict ethical standard. deal with the compilation, maintenance and use of data, but also work with lawyers, activists and the general public to ensure that it complies with legal standards, it is collected only with consent and can be removed if someone chooses to withdraw personal information. Such management organizations will not be required for all datasets – but will certainly be required for scraped data that may contain biometric or personally identifiable information or intellectual property.

“Dataset collection and monitoring is not a one-time task for one or two people,” he says. “If you do it responsibly, it breaks down into a number of different tasks that require deep thinking, deep expertise, and a variety of different people.”

In recent years, the field has moved more and more towards the belief that more carefully selected datasets will be key to tackling many of the industry’s technical and ethical challenges. It is now clear that creating more responsible datasets is not nearly enough. Those working in AI must make a long-term commitment to maintaining and using them ethically.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

(0)