6. What kinds of copyright-protected training materials are used to train AI models, and how are those materials collected and curated?

6.1. How or where do developers of AI models acquire the materials or datasets that their models are trained on?

To what extent is training material first collected by third-party entities (such as academic researchers or private companies)?

It was my feeling that in answer to the question “what kinds of copyright-protected training materials are used to train AI models”, it may be easier to look at the images, but visualizing that data in its native .parquet form is not accessible to everyone - hence the creation of this website.

Stable Diffusion

The images found within this website were used in the training of Stable Diffusion 1 and 2 as evidenced on the ‘Training’ section of their Model Card files:

Currently six Stable Diffusion checkpoints are provided, which were trained as follows.

stable-diffusion-v1-1: 237,000 steps at resolution 256x256 on laion2B-en. 194,000 steps at resolution 512x512 on laion-high-resolution (170M examples from LAION-5B with resolution >= 1024x1024).

stable-diffusion-v1-2: Resumed from stable-diffusion-v1-1. 515,000 steps at resolution 512x512 on "laion-improved-aesthetics" (a subset of laion2B-en, filtered to images with an original size >= 512x512, estimated aesthetics score > 5.0, and an estimated watermark probability < 0.5. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an improved aesthetics estimator).

stable-diffusion-v1-3: Resumed from stable-diffusion-v1-2 - 195,000 steps at resolution 512x512 on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling.

stable-diffusion-v1-4 Resumed from stable-diffusion-v1-2 - 225,000 steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling.

stable-diffusion-v1-5 Resumed from stable-diffusion-v1-2 - 595,000 steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling.

stable-diffusion-inpainting Resumed from stable-diffusion-v1-5 - then 440,000 steps of inpainting training at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and in 25% mask everything.

We currently provide the following checkpoints:

512-base-ema.ckpt: 550k steps at resolution 256x256 on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with punsafe=0.1 and an aesthetic score >= 4.5. 850k steps at resolution 512x512 on the same dataset with resolution >= 512x512.

Stable Diffusion's training methodology demonstrates a bias towards images of aesthetic quality. That is to say, the quality and capabilities of the resulting AI model is a direct result of the quality of the training data. It's for this reason this site is focused on those images ranked >=4.5 in the LAION dataset.

( Stability AI’s latest release, Stable Diffusion XL, no longer discloses the training data used https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0.)

The Data

This Site

This site is built on a modestly resourced database with effective indexes for domain and aesthetic score. It holds all 1,379,851,932 url entries contained in the ‘improved_aesthetics_4.5plus’ dataset.

Domains

Each url entry was parsed to determine its domain which was added as a datapoint. Tallies of images by each domain are available on the Totals page, which may be used as a means of navigation.

In total there are 9,489,747 domains that host image content referenced by this subset.

Aesthetic Score

Entries in the Gallery are presented in order of highest to lowest ranked aesthetic score. Low / high threshold values may be set to focus on particular bands of the data.

Ranges

Domain Detail

Viewing the info page for any specific domain reveals the totals for each aesthetic range outlined above, as well as their percentage contribution to those ranges.

The percentages are accompanied by a graph normalized to the domain’s highest contributing percentage. This graph may, at a glance, show which aesthetic range a site’s images trend to contribute the most.

weboptout

Range	Total URLs
4.5 - 5.0	766,601,696
5.0 - 5.5	481,529,790
5.5 - 6.0	111,823,637
6.0 - 6.5	11,461,253
6.5 - 7.0	621,877
7.0 +	13,683

Some domains contain data from weboptout, an open source Python library that automates scanning websites for Terms of Service language that prohibits external crawling and indexing of their data.

If found, the info page for that domain will contain a link to the Terms of Service, as well as the language that reflects what it found.

Questions

The intention of this site is to allow researchers to answer some questions for themselves. For any given host found in LAION, consider the following:

Laion Aesthetic Gallery

Welcome