Fei-Fei Li and ImageNet: The Dataset That Fed the Revolution

Zusammenfassung

Fei-Fei Li arrived in the United States from China at sixteen with no English and little money, worked in a dry cleaning shop to help her family pay their mortgage, and went on to create ImageNet — a dataset of fourteen million labeled images that became the training ground for the deep learning revolution. The ImageNet Large Scale Visual Recognition Challenge, which she organized from 2010 onward, produced the 2012 AlexNet result that permanently changed AI research. Her subsequent career — Stanford professor, Google Cloud AI lead, founder of the Human-Centered AI Institute — made her one of the most influential figures in both the technical development and public understanding of artificial intelligence.

Immigration and the Dry Cleaning Shop

Li Feifei was born on July 3, 1976, in Beijing. Her parents were both scientists; her father worked in physics, her mother in accounting. In 1992, when she was sixteen, the family emigrated to Parsippany, New Jersey. The transition was difficult in the ways that immigrant transitions typically are, and more difficult in specific ways: her parents had difficulty finding work in their fields, and the family’s finances were precarious.

Li got a part-time job at a dry cleaning shop — cleaning, pressing, working the counter — to help her family make mortgage payments. She has described the experience in terms of the specific anxiety of financial precariousness: not the drama of poverty, but the grinding arithmetic of household expenses and what happens if the next payment cannot be made. The work continued through high school and into her early college years.

She was admitted to Princeton and graduated summa cum laude in 1999 with a degree in physics, writing her senior thesis on the neural correlates of visual attention. She spent a gap year in Ganzi, Tibet, studying traditional Chinese medicine — an experience she has described as broadening in the literal sense: a year doing something completely different from academic research before committing to it fully. She completed her PhD in electrical engineering at the California Institute of Technology in 2005, working with Pietro Perona on computational models of visual recognition.

The Idea That Seemed Obvious

Li joined the faculty at the University of Illinois Urbana-Champaign in 2005, then moved to Princeton before settling at Stanford in 2009, where she built the Stanford Vision Lab. Her central research question was how computers could recognize visual objects: given a photograph, could a machine identify what it contained?

The problem was old. Efforts at machine vision dated to the late 1960s. By the early 2000s, there were many proposed approaches — hand-crafted visual features, support vector machines, bag-of-words models — that worked reasonably well on controlled benchmark datasets. But Li observed a fundamental limitation: the datasets themselves were tiny. The standard benchmark, Caltech 101, had 9,144 images across 101 categories. A computer vision algorithm trained on Caltech 101 had seen, at most, ninety images per category.

Her intuition was straightforward and, in retrospect, correct: human visual recognition is robust because humans have seen millions of images in thousands of categories across enormous variation in lighting, angle, occlusion, and context. Any algorithm hoping to match human-level visual recognition needed to be exposed to a comparable richness of data. The community had been trying to develop better algorithms; she wanted to build better data.

Info

The insight underlying ImageNet — that scale of training data matters as much as algorithmic architecture — was not yet conventional wisdom in 2006. The dominant approach to improving AI performance was algorithm development: better features, better classifiers, better optimization methods. The idea that simply providing more data would drive improvements equivalent to or greater than algorithmic advances was contested. The field took ImageNet and AlexNet together to confirm it.

Building ImageNet

Li presented the idea at a Princeton faculty meeting in 2006 and encountered skepticism. A colleague told her the idea was “stupid” — building a dataset was not research, it was engineering labor. She proceeded anyway.

The architectural backbone was WordNet, a lexical database of English developed by George Miller and colleagues at Princeton starting in the 1980s, which organized English words into semantic hierarchies called synsets. Li used WordNet’s structure to define the categories for ImageNet: each node in the WordNet hierarchy (from specific objects like “golden retriever” up through intermediate categories like “dog,” “mammal,” “animal,” “living thing”) would have a corresponding set of images. The target was roughly 500–1,000 images per synset, across all 80,000+ WordNet synsets.

The labor problem was formidable. Collecting 14 million images required download, filtering for explicit content, and labeling — each image needed to be verified as belonging to its category. Early in the project, Li and her students attempted to hire Princeton undergraduates to label images at $10 per hour. The cost was prohibitive.

The solution was Amazon Mechanical Turk, Amazon’s crowdsourcing platform that connected requesters with “workers” who completed small online tasks for small payments. Li’s team developed protocols for using Mechanical Turk to label images at scale: workers were shown an image and asked whether it depicted a specific concept, with multiple workers confirming each label to ensure quality. The cost per labeled image was approximately one cent. The project ran from 2007 to 2009, involving workers from 167 countries and producing, ultimately, more than 14 million images across 21,841 categories — the largest labeled image dataset ever assembled.

The full dataset was released publicly at the CVPR 2009 conference, free of charge, to any researcher who wanted it.

The ILSVRC and the Competition

In 2010, Li and her colleagues launched the ImageNet Large Scale Visual Recognition Challenge (ILSVRC): an annual competition in which teams trained algorithms on a subset of ImageNet — 1.2 million training images across 1,000 categories — and were evaluated on their ability to correctly classify a test set of 150,000 images. The metric was top-5 error: whether the correct category appeared anywhere in the algorithm’s five highest-confidence predictions.

The first ILSVRC in 2010 was won by a team from NEC Labs with a top-5 error rate of 28%. The 2011 winner improved to 25.8%. The competition attracted teams from major research labs worldwide; it was, by design, a public leaderboard that measured progress annually and made comparison across approaches straightforward.

The 2012 competition produced a result that, more than any other single event, catalyzed the deep learning revolution. AlexNet — entered by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton from the University of Toronto — achieved a top-5 error rate of 15.3%. The second-place entry achieved 26.2%. A gap of nearly eleven percentage points, on a benchmark that had been improving by one to two points per year. The winning system was a deep convolutional neural network trained on two NVIDIA GPU graphics cards using techniques — dropout regularization, ReLU activations, data augmentation — that the team had refined specifically for this problem.

Every year after 2012, the ILSVRC winners were deep neural networks. By 2015, the winning top-5 error rate had dropped to 3.57% — better than the generally cited human error rate of approximately 5% on the same test. The competition had run its course; in 2017, the final ILSVRC was held, with the organizing committee noting that the state of the art had advanced beyond what the benchmark could meaningfully distinguish.

Warnung

The comparison to “human error rate” on ImageNet requires interpretation. The human baseline of approximately 5% top-5 error was measured on a specific protocol — a human working through the same 1,000-category task with the same constraints — and represents performance on a difficult, fine-grained recognition task (distinguishing 120 breeds of dog, for example) that most humans have never been asked to do. It is not a claim that machines in 2015 matched human visual cognition generally; it is a claim that they matched human performance on this specific benchmark. The difference is important.

See Geoffrey Hinton and Deep Learning for the full story of AlexNet and its implications.

Google, Stanford, and Human-Centered AI

In 2017, Li took a leave from Stanford to join Google as Chief Scientist of AI/ML at Google Cloud, a role that gave her visibility into how AI systems were being deployed at commercial scale — not in research papers, but in products used by millions. The experience shaped her thinking about the gap between AI research and AI deployment: the problems that mattered most in research (benchmark performance) were not always the problems that mattered most in practice (reliability, fairness, interpretability, impact on workers).

She returned to Stanford in 2018. The following year, in 2019, she co-founded the Stanford Institute for Human-Centered Artificial Intelligence (HAI) with John Etchemendy, the former Stanford provost. HAI’s mission was to ensure that AI development was guided by human needs, values, and welfare — that the people most affected by AI deployment had representation in the conversations about how it was designed and governed.

The institute’s founding reflected a position that Li had developed over years: that the technical and ethical questions in AI were not separable. Building systems that recognized images well was not sufficient if those systems failed systematically on images of certain groups of people, or if the dataset used to train them reflected the biases of the people who collected it.

The Dataset and Its Biases

ImageNet’s scale and influence also made it a test case for the emerging field of dataset ethics. Researchers examining ImageNet found that the people images in the dataset reflected demographic skews that produced predictable failures: recognition systems trained on ImageNet performed worse on darker-skinned faces, on women in professional contexts, and on cultural contexts underrepresented in the labeled data.

More specifically, the ImageNet Large Scale Visual Recognition Challenge’s person categories — using nouns from WordNet to label images of people — included offensive and derogatory terms that had been in the WordNet vocabulary. A 2020 paper by Vinay Uday Prabhu and Abeba Birhane, “Large image datasets: A pyrrhic win for computer vision?” documented these problems systematically. Li’s team responded by working to audit and clean the person-related categories in ImageNet.

The episode illustrated a broader problem with large datasets assembled at speed using crowdsourced labor: the categories that data encodes, and the images that workers select as representative of those categories, reflect the perspectives and blind spots of the people involved. A dataset assembled primarily by American workers using American conceptions of what objects look like will differ systematically from one assembled globally, and training on it will produce systems that fail in predictable ways for users whose experience differs from the training distribution.

Li became one of the more articulate voices on this problem precisely because she had built the most influential dataset in the field. She was not defending the status quo; she was arguing, from direct experience, that the way data was collected, labeled, and evaluated shaped the AI systems built on it in ways that needed to be understood and addressed.

AI4ALL, a nonprofit Li co-founded in 2017 with Olga Russakovsky and Rick Sommer, runs summer programs to expose high school students — particularly students from groups underrepresented in AI — to AI research and careers. The programs operate at Stanford, Carnegie Mellon, Princeton, Boston University, and other institutions. The premise is that increasing the diversity of people who build AI is not only a matter of fairness; it is necessary for building AI systems that work well for diverse populations.

The Woman Who Fed the Revolution

There is something specific about Fei-Fei Li’s position in the deep learning story that deserves attention. The technical heroes of the AlexNet moment are Hinton, Krizhevsky, Sutskever, LeCun, and Bengio — the algorithm architects. Li built the arena in which the race was run. Without ImageNet, the 2012 moment would have happened differently, or later, or on a benchmark that would have made the improvement less legible.

The specific insight — that data scale was the bottleneck, not algorithmic sophistication — was not self-evident in 2006. It required someone who looked at the field from a systems perspective and saw what was missing. Li’s combination of computer vision expertise, cognitive science interest, and willingness to spend years on an infrastructure project that produced no publications by itself was the specific combination needed.

She was named a recipient of the 2025 Queen Elizabeth Prize for Engineering, alongside Geoffrey Hinton, Yoshua Bengio, Yann LeCun, Jensen Huang, and Bill Dally, for contributions to machine learning. Her TED talk “How we’re teaching computers to understand pictures” (2015) has been viewed more than 3 million times and remains one of the clearest accessible explanations of how visual recognition AI works.

For the AI research context in which ImageNet was deployed, see The Rise of Artificial Intelligence and The GPU Revolution. For the role of datasets in shaping AI capabilities and limitations, see Andrew Ng and AI Education.