Stable Diffusion and AI Image Generation

Zusammenfassung

In 2022, making pictures from words went from research demo to global phenomenon in a single summer. OpenAI’s DALL·E 2 and Google’s Imagen proved that text prompts could conjure photorealistic images — but locked the models behind corporate servers. Then on August 22, 2022, Stability AI and a German university group released Stable Diffusion with its weights downloadable by anyone, runnable on a consumer gaming GPU. Within weeks an explosion of fine-tunes, plugins, and tools turned image generation into a participatory medium — and detonated the fiercest copyright and ethics fight in the history of AI, as artists discovered their life’s work had been scraped to train a machine that now competed with them. The technology underneath, the diffusion model, became the dominant approach not just for images but for video, audio, 3-D, and even protein structure.

How Diffusion Works: Sculpting from Noise

Earlier image generators relied on GANs (generative adversarial networks; see The Generative AI Revolution), which were powerful but notoriously unstable to train. Diffusion models took a different, more robust route, rooted in physics.

The idea: take a real image and gradually add random noise over many steps until it is pure static — a process easy to define. Then train a neural network to reverse it, predicting and removing a little noise at each step. Once trained, you start from pure random noise and let the network denoise it step by step into a coherent image. A text prompt steers the denoising (via a text encoder like CLIP) so the picture that emerges matches the words. The mathematics traces to a 2015 paper by Sohl-Dickstein and was made practical by Ho et al.’s DDPM (2020).

The Latent-Diffusion Breakthrough

Pure pixel-space diffusion is brutally expensive — denoising millions of pixels over dozens of steps. The decisive efficiency idea came from the CompVis group at Ludwig Maximilian University of Munich (LMU), led by Björn Ommer: High-Resolution Image Synthesis with Latent Diffusion Models (Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Ommer; arXiv Dec 2021, CVPR 2022).

The trick: run diffusion not on raw pixels but in a compressed latent space produced by an autoencoder — perhaps 48× smaller. Denoise the small latent, then decode once to a full image. This latent diffusion model (LDM) cut the compute so far that generation became feasible on a single consumer GPU. That efficiency is exactly what made an open, run-it-at-home model possible.

August 2022: The Weights Go Public

Stable Diffusion was released on August 22, 2022 — a collaboration between Stability AI (which funded the compute, led by Emad Mostaque), the LMU CompVis group (the method), Runway (Patrick Esser), LAION (the dataset), and others. It was trained largely on LAION-5B, a dataset of ~5 billion image–text pairs scraped from the open web.

The radical move was the license: the permissive CreativeML OpenRAIL-M license with publicly downloadable weights. Where DALL·E 2 and Imagen were APIs you queried, Stable Diffusion was a file you owned. This single decision reshaped the field:

An ecosystem exploded. Within months: web UIs (AUTOMATIC1111), fine-tuning methods (DreamBooth, LoRA), precise structural control (ControlNet), and a sprawling community marketplace (Civitai) of custom models for any style imaginable.
The competitive landscape. DALL·E (Jan 2021) and DALL·E 2 (Apr 2022) from OpenAI, Google’s Imagen (May 2022), and the subscription service Midjourney (open beta July 2022) defined the closed, polished end; Stable Diffusion defined the open, hackable end. Later versions — SDXL (2023) and Stable Diffusion 3 (2024) — improved quality and text rendering.

Dead End / Reckoning: The Copyright and Ethics Firestorm

Open weights plus web-scraped training data lit a fire that still burns.

Artists vs. the scrapers. LAION-5B contained billions of copyrighted images scraped without consent or payment, including the portfolios of living, named artists — whose styles the model could now imitate on demand. Andersen v. Stability AI (a class action by artists) and Getty Images v. Stability AI (Getty alleged ~12 million of its photos were used, some outputs even reproducing a garbled Getty watermark) became landmark test cases for whether training on copyrighted work is infringement or fair use. The legal question remains unsettled and is the defining IP fight of the generative era.
Deepfakes and abuse. Downloadable, uncensorable weights meant the safety filters Stability shipped could simply be removed. The technology was immediately used to generate non-consensual sexual imagery and child-sexual-abuse material; researchers later found CSAM had contaminated the LAION dataset itself, forcing its temporary withdrawal. The same openness that democratized creativity removed every guardrail.
The business didn’t hold. Stability AI gave away its crown jewel and struggled to monetize it; amid mounting losses and departures, founder Emad Mostaque resigned as CEO in March 2024. The lab that defined open image generation became a cautionary tale about whether “open and free” is a viable strategy at the AI frontier — the same tension faced by open-weight LLM labs.

Beyond Images

Diffusion turned out to be a general-purpose generative engine. The same denoising principle now powers text-to-video (OpenAI’s Sora, Google’s Veo, Runway), audio and music generation, 3-D asset creation, and — in a striking cross-pollination — the molecular-structure generator inside AlphaFold3. The image models of 2022 were the visible tip of a method that quietly became one of the two dominant paradigms of generative AI, alongside the transformer-based language model.

Fun Fact: A University Lab Out-Shipped Big Tech

The architecture behind the world’s most-downloaded image generator was not born at OpenAI or Google but at a public university computer-vision lab in Munich. Google’s Imagen and OpenAI’s DALL·E 2 were arguably higher quality at launch — but they stayed locked indoors. The CompVis group’s latent-diffusion paper made generation cheap enough to set free, and Stability AI’s decision to release the weights meant a German academic method, not a Silicon Valley product, became the foundation of the open image-generation world. Big Tech had the better demos; the university lab had the bigger impact.