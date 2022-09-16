Digital artists, painters, and photographers are beginning to wonder if among the billions of images used to train image synthesis AI models like DALL-E 2 or Stable Diffusion—so fashionable in recent months—there are your own creations, previously published on the Internet and selected by the creators of these AIs to be part of their respective ‘datasets’.





So the artist couple Mat Dryhurst and Holly Herndon, themselves veterans of neural network training, have created a website from the information of one of those datasets; specifically LAION-5B, used to train Stable Diffusion, Midjourney and Google’s Image AI models, and which contains 5.8 billion images. Although they warn that, in the future, the content of data sources will be added.

Dicha web se titula ‘Have I Been Trained?’, y allows us to perform searches as if we were using Google Images (i.e. either by uploading a reference image to perform a reverse search or by entering a search term). Thanks to that, we can search our own images to know if they have been used in this datasetor merely explore its content.



Reverse lookup example.



When we run a reverse lookup on an image present in the dataset, the lookup URL shows us which website it was pulled from.

This website does not allow us, when we explore the content of LAION-5B, to know what kind of metadata the creators of the dataset have attached to thembut for that we can use another website (Laion-aesthetic-6pls), which does not allow reverse searches, but offers much more information about the images included in the dataset (or, at least, about a small sample of them) .



Laion-aesthetic-6pls

Metadata is a very important aspect of using images for AI training, as the correspondence between image and data determines the quality of the results when we introduce terms in the image generators (what we know as ‘prompts’). Those prompts determine from which real images does the AI ​​extract the patterns (not ‘image pieces’) which will then be applied to the generated works.

A debate about what is ethically acceptable

The objective of Dryhurst and Herndon is to promote a debate on the limit between what is ethical and what is technologically possible, by denouncing the use without consent of images extracted from major internet platforms such as Pinterest, Getty Images, ArtStation or DeviantArt.

They themselves are promoting (through the Spawning organization) the development of a standard called Source+designed as a mechanism for artists to allow and deny the use of their images (and text, and audio) as training data.

“I am very optimistic about the possibility and utility of creating a verified database of artists’ opt-in and opt-out wishes.”

This project does not raise a reflection on the legality of the use of these images for training, since the creators of the AIs operate mainly from the United States, where the laws clearly support such use.

Spawning’s goal is not to prevent Stable Diffusion users from typing “house in Rembrandt style” as an AI prompt, because this artist is dead and his work is already in the public domain. Its managers are, on the contrary, more concerned that the distinctive style of living artists may be used without their permission..

And even they, cWith time, they believe, they will not be massively opposed to incorporating their works into datasets:

“I think ultimately more will choose to participate than not, but first we have to establish mutual respect.”

