The Future of AI Relies on a High School Teacher’s Free Database

On the outskirts of Hamburg, Germany, there's a house with a mailbox that bears a single word "LAION" written in pencil. This residence belongs to the mastermind behind a widespread data collection operation that's at the core of the AI phenomenon taking the world by storm.

Christoph Schuhmann, a high school teacher, is the individual responsible for LAION - the acronym for "Large-scale AI Open Network." Building the world's most extensive free AI training dataset is his brainchild by working with a small group of volunteers teaching physics and computer science to German teenagers.This dataset has already been utilised by prominent text-to-image generators like Stable Diffusion and Google's Imagen.

AI text-to-image generators are heavily reliant on datasets such as LAION to amass vast amounts of visual content, which is then utilised to deconstruct and produce new images.

The launch of these products in 2020 was a game-changer that intensified the AI arms race within the tech industry, giving rise to numerous ethical and legal concerns.

Within a few months, copyright infringement lawsuits were filed against generative AI firms like Stability AI and Midjourney, while critics called attention to the violent, sexual, and problematic imagery present in their datasets. These critics argued that the biases introduced by such content were nearly intractable to eliminate.

Two years ago, a 40-year-old teacher and trained actor co-founded LAION with a group of AI enthusiasts discussed on a Discord server the inception of OpenAI's DALL-E, which utilises deep learning to generate digital images based on language prompts, sparked Schuhmann's interest and worry that it would lead to big tech firms monopolising data.

As a result, he was motivated to create a platform that could democratise access to image data. For instance, by creating an image of a pink chicken relaxing on a couch in response to a language prompt, users could generate custom images without exclusive reliance on tech giants' proprietary data.

Schuhmann explained his concern, stating that he knew immediately that if the technology was monopolised by only a small group of corporations, it would have detrimental consequences for society. Consequently, he and a group of collaborators from the AI enthusiast Discord server decided to construct an open-source dataset that would facilitate training of image-to-text diffusion models. The process was similar to teaching a language using millions of flashcards and lasted several months.

The group utilised raw HTML code from Common Crawl, a California-based non-profit, to search for images online and assign them descriptive text. The dataset was not curated manually or by humans.

In a matter of weeks, Schuhmann and his team managed to accumulate 3 million pairs of image-text sets. Three months later, their efforts had produced a whopping 400 million pairs, which has since grown to over 5 billion pairs - making LAION the most extensive open-source dataset of its kind. As LAION gained popularity, Schuhmann and his team dedicated their time to the project without receiving any pay, only receiving a one-off donation from the Hugging Face machine learning company in 2021.

However, one day a retired hedge fund administration entered their Discord chat. In 2022, Stability AI was launched by Mostaque, who utilised LAION's dataset to create Stable Diffusion, the company's primary AI image generator. Mostaque also employed two researchers from the LAION organisation. A year later, the company is currently searching for a valuation of $4 billion, with LAION's data being a significant contributor to this success.

Notably, Schuhmann, the creator of LAION, has deliberately not profited from the project and is uninterested in doing so despite the success of companies that have utilised its dataset. As a high school teacher, Schuhmann has rejected multiple job offers, determined to keep LAION an independent entity.

