This site is an exploration of the dataset improved_aesthetics_4.5plus, a subset of the LAION5B dataset. It was created in response to the United States Copyright Office request for comments on Generative AI, which asks (among other questions) the following:
6. What kinds of copyright-protected training materials are used to train AI models, and how are those materials collected and curated?
6.1. How or where do developers of AI models acquire the materials or datasets that their models are trained on?
To what extent is training material first collected by third-party entities (such as academic researchers or private companies)?
It was my feeling that in answer to the question “what kinds of copyright-protected training materials are used to train AI models”, it may be easier to look at the images, but visualizing that data in its native .parquet form is not accessible to everyone - hence the creation of this website.
The images found within this website were used in the training of Stable Diffusion 1 and 2 as evidenced on the ‘Training’ section of their Model Card files:
https://huggingface.co/runwayml/stable-diffusion-v1-5
From the SD 1.5 readme file:
Currently six Stable Diffusion checkpoints are provided, which were trained as follows.
stable-diffusion-v1-1: 237,000 steps at resolution 256x256 on laion2B-en. 194,000 steps at resolution 512x512 on laion-high-resolution (170M examples from LAION-5B with resolution >= 1024x1024).
stable-diffusion-v1-2: Resumed from stable-diffusion-v1-1. 515,000 steps at resolution 512x512 on "laion-improved-aesthetics" (a subset of laion2B-en, filtered to images with an original size >= 512x512, estimated aesthetics score > 5.0, and an estimated watermark probability < 0.5. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an improved aesthetics estimator).
stable-diffusion-v1-3: Resumed from stable-diffusion-v1-2 - 195,000 steps at resolution 512x512 on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling.
stable-diffusion-v1-4 Resumed from stable-diffusion-v1-2 - 225,000 steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling.
stable-diffusion-v1-5 Resumed from stable-diffusion-v1-2 - 595,000 steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling.
stable-diffusion-inpainting Resumed from stable-diffusion-v1-5 - then 440,000 steps of inpainting training at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and in 25% mask everything.
https://huggingface.co/stabilityai/stable-diffusion-2-1
From the SD 2.1 readme file:
We currently provide the following checkpoints:
512-base-ema.ckpt: 550k steps at resolution 256x256 on a subset of LAION-5B filtered for explicit pornographic material, using the LAION-NSFW classifier with punsafe=0.1 and an aesthetic score >= 4.5. 850k steps at resolution 512x512 on the same dataset with resolution >= 512x512.
Stable Diffusion's training methodology demonstrates a bias towards images of aesthetic quality. That is to say, the quality and capabilities of the resulting AI model is a direct result of the quality of the training data. It's for this reason this site is focused on those images ranked >=4.5 in the LAION dataset.
( Stability AI’s latest release, Stable Diffusion XL, no longer discloses the training data used https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0.)
Entries in improved_aesthetics_4.5plus contain the following columns:
This site is built on a modestly resourced database with effective indexes for domain and aesthetic score. It holds all 1,379,851,932 url entries contained in the ‘improved_aesthetics_4.5plus’ dataset.
Each url entry was parsed to determine its domain which was added as a datapoint. Tallies of images by each domain are available on the Totals page, which may be used as a means of navigation.
In total there are 9,489,747 domains that host image content referenced by this subset.
Entries in the Gallery are presented in order of highest to lowest ranked aesthetic score. Low / high threshold values may be set to focus on particular bands of the data.
Aesthetic scores were pre-tallied as allocated in the following ranges
Range | Total URLs |
---|---|
4.5 - 5.0 | 766,601,696 |
5.0 - 5.5 | 481,529,790 |
5.5 - 6.0 | 111,823,637 |
6.0 - 6.5 | 11,461,253 |
6.5 - 7.0 | 621,877 |
7.0 + | 13,683 |
Viewing the info page for any specific domain reveals the totals for each aesthetic range outlined above, as well as their percentage contribution to those ranges.
The percentages are accompanied by a graph normalized to the domain’s highest contributing percentage. This graph may, at a glance, show which aesthetic range a site’s images trend to contribute the most.
Some domains contain data from weboptout, an open source Python library that automates scanning websites for Terms of Service language that prohibits external crawling and indexing of their data.
If found, the info page for that domain will contain a link to the Terms of Service, as well as the language that reflects what it found.
The intention of this site is to allow researchers to answer some questions for themselves. For any given host found in LAION, consider the following:
https://laion.ai/blog/laion-aesthetics/#laion-aesthetics-v2 - LAION’s work from which this site is based.
https://laion-aesthetic.datasette.io/ a full SQL search engine for the top 12 million aesthetic scored images in this dataset.
https://haveibeentrained.com/ a service which provides text, CLIP neighbor (visual recognition based) and domain based searches, as well as maintains ‘opt-out’ requests made from license holders.
For more information on how to use the site please visit the Help page.
The copyright to the images made visible to this site are reserved by their respective owners.
This website hosts no image content of its own, it provides deeplink image addresses hosted by the original domains.
Though this site filters NSFW content above a particular ‘punsafe’ thereshold, NSFW content may be found when searching.
This site is an independent research project intended to provide further visibility into AI training data.