The AI feedback loop: Researchers warn of ‘model collapse’ as AI trains on AI-generated content
Generative AI Takes the Spotlight as Leading Global Companies Embrace the Future
In a remarkable turn of events, the age of generative AI has swiftly arrived. In fact, reports indicate that up to half the workforce in certain organisations now rely on generative AI, while numerous other companies are eagerly venturing into the market with their own products featuring built-in generative AI capabilities.
The rapid adoption of generative AI signifies a significant shift in how businesses operate and underscores the growing influence of artificial intelligence in various sectors. OpenAI's ChatGPT, with its exceptional language generation capabilities, has spearheaded this transformative trend, captivating professionals across industries and paving the way for new possibilities. Employees are leveraging this powerful technology to automate repetitive tasks, streamline customer interactions and even assist in complex decision-making processes.
In response to this wave of innovation, a multitude of companies are eagerly venturing into the market, aiming to capitalise on the vast potential of generative AI. By integrating this cutting-edge technology into their products and services, businesses are poised to unlock new levels of creativity, interactivity, and personalization. From chatbots and virtual assistants to content creation and data analysis, the applications of generative AI are far-reaching and diverse.
The advent of generative AI also brings with it a set of challenges and considerations. As companies increasingly rely on AI-powered systems, concerns around data privacy, security, and ethics come to the forefront. It becomes crucial for organisations to develop robust frameworks and guidelines to ensure responsible and ethical use of generative AI, safeguarding both individuals and businesses from potential risks.
As the age of generative AI continues to unfold, it holds immense promise for reshaping industries and pushing the boundaries of innovation. The rapid adoption by leading global companies signifies a resounding vote of confidence in this groundbreaking technology. With each passing day, more organisations are joining the generative AI revolution, harnessing its power to unlock new possibilities, drive growth, and navigate the dynamic landscape of the future.
However, amidst the growing popularity of generative AI, concerns arise regarding the data used to train these models. The foundation of products like ChatGPT, Stable Diffusion, and Midjourney relies on large language models (LLMs) and transformer models, which initially drew data from human sources such as books, articles and photographs — content created without the assistance of artificial intelligence.
As AI-generated content becomes more prevalent on the internet and serves as the basis for training AI models, a pressing question emerges: What happens when AI-generated content proliferates, replacing primarily human-generated content in training?
A team of researchers from the United Kingdom and Canada recognized the significance of this issue and recently published a paper on their findings in the open access journal arXiv. Their work sheds light on a troubling aspect of current generative AI technology: "We find that use of model-generated content in training causes irreversible defects in the resulting models."
Focusing on probability distributions in text-to-text and image-to-image AI generative models, the researchers concluded that "learning from data produced by other models causes model collapse—a degenerative process whereby, over time, models forget the true underlying data distribution... this process is inevitable, even for cases with almost ideal conditions for long-term learning."
Ilia Shumailov, one of the paper's lead authors, expressed surprise at how rapidly model collapse occurs: "Models can rapidly forget most of the original data from which they initially learned... Over time, mistakes in generated data compound and ultimately force models that learn from generated data to misperceive reality even further."
In essence, as an AI training model is exposed to more AI-generated data, its performance deteriorates over time. It starts producing more errors in its responses and content, while also offering less non-erroneous variety in its outputs.
Ross Anderson, a professor of security engineering at Cambridge University and the University of Edinburgh, further elaborated on the implications of this research in a blog post. He likened the impending flood of AI-generated content to the ocean's accumulation of plastic trash and the atmosphere's saturation with carbon dioxide. Anderson pointed out that this surge in subpar content will make it harder to train newer models through web scraping, giving an advantage to companies that have already accumulated training data or control access to human interfaces on a large scale. He also noted that AI startups are already exerting pressure on the Internet Archive for training data.
Ted Chiang, a renowned sci-fi author and Microsoft writer, contributed to the discussion with an article in The New Yorker. He postulated that AI copies of copies would result in a degradation of quality, drawing a parallel to the visible artefacts that accumulate when repeatedly copying a JPEG image.
Another analogy to grasp the issue is found in the 1996 sci-fi comedy film "Multiplicity" starring Michael Keaton. In the movie, a man clones himself and subsequently clones the clones, resulting in a progressive decrease in intelligence and an increase in absurdity.
As the age of generative AI progresses, these challenges surrounding the quality and integrity of AI-generated data necessitate careful consideration. Striking a balance between the benefits of AI-driven content generation and the potential risks of model collapse is crucial to ensure the responsible and sustainable development of this transformative technology.
AI Catalog's chief editor