AI Ethics and Algorithmic Bias

Zusammenfassung

Artificial intelligence systems encode the data they are trained on — including the historical biases, structural inequalities, and measurement errors embedded in that data. By the mid-2010s, AI systems were making or informing decisions about parole, hiring, lending, medical care, and facial recognition at scale, in ways that systematically disadvantaged Black people, women, and other marginalized groups. The field of AI ethics emerged as a combination of technical research (measuring and mitigating bias in ML systems), policy work (advocating for regulation), and institutional struggle (researchers inside major tech companies conflicting with corporate priorities). Key milestones: Gender Shades (Joy Buolamwini, 2018), COMPAS and recidivism prediction controversy (2016), the Timnit Gebru firing from Google (2020), and the ongoing debate about what fairness in automated decision-making actually means.

The Problem: Algorithms Reflect Data History

Machine learning models learn patterns from historical data. If the historical data reflects discrimination — if loan approval rates, hiring rates, criminal conviction rates, or medical diagnosis rates have been systematically different across demographic groups because of historical bias rather than actual underlying differences — then a model trained on that data will reproduce those patterns.

This is not a bug that can be fixed by making models more accurate. It is a structural property of learning from biased data. A credit model trained on historical lending data in the United States will learn that Black applicants are higher risk — because historical lending discrimination made Black borrowers more likely to face financial stress that correlated with loan default, independent of creditworthiness. A hiring model trained on historical hiring decisions will learn the patterns of those decisions, including their discriminatory elements.

The problem extends beyond historical bias in training data. Measurement bias occurs when the proxies used to represent the quantity of interest differ systematically across groups. A healthcare model predicting patient risk might use healthcare spending as a proxy for health need — but if Black patients receive less care for equivalent conditions (as documented in the medical literature), spending will underestimate need for Black patients and the model will allocate less care to a sicker population.

Representation bias occurs when training datasets underrepresent certain groups, leading models to perform worse on those groups. A facial recognition system trained primarily on light-skinned faces will perform worse on dark-skinned faces, not because darkness is inherently harder to recognize but because the training data distribution makes it harder to learn relevant features.

Gender Shades: Measuring Face-Based AI

Joy Buolamwini, while a graduate student at the MIT Media Lab, noticed that a facial recognition system could not detect her face until she put on a white mask. This observation led to her dissertation research and the Gender Shades paper (co-authored with Timnit Gebru, 2018), one of the most influential AI ethics studies of the decade.

Gender Shades evaluated commercial facial analysis systems from Microsoft, IBM, and Face++ on a task of binary gender classification (man/woman). The study used a dataset of parliamentarians and TV presenters — real-world faces with ground-truth gender — and measured accuracy across intersections of gender and skin tone (darker-skinned females, darker-skinned males, lighter-skinned females, lighter-skinned males).

The results were stark. All three systems performed significantly worse on darker-skinned faces, and significantly worse on female faces, with the worst performance on darker-skinned females:

Lighter-skinned males: error rate as low as 0.0%
Darker-skinned females: error rate as high as 34.7%

The paper demonstrated that commercial AI systems had substantial performance disparities across demographic groups and that vendors’ marketing claims of high accuracy were based on benchmark results that did not reflect performance on underrepresented groups. The study triggered rapid responses from all three companies, who improved their systems before a 2019 follow-up study.

Gender Shades established the methodology of algorithmic audit — systematic measurement of AI system performance across demographic subgroups using independently collected evaluation data — as a standard tool in AI ethics research.

The Regulatory Consequence

Gender Shades contributed to a wave of facial recognition regulation. San Francisco banned city use of facial recognition in 2019. The EU AI Act (2024) restricted the use of real-time facial recognition in public spaces. Amazon halted police sales of its Rekognition facial recognition product (then resumed), IBM exited the facial recognition market entirely. Buolamwini’s work, alongside organizing by civil liberties groups, directly influenced regulatory outcomes.

COMPAS: Algorithmic Recidivism Prediction

In May 2016, ProPublica published “Machine Bias,” an investigation into COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), an algorithmic risk assessment tool used in criminal sentencing in several US states. Defendants were given risk scores (1–10) predicting likelihood of re-arrest; these scores influenced bail, sentencing, and parole decisions.

ProPublica’s analysis found that COMPAS was more likely to falsely label Black defendants as high-risk (false positives) and more likely to falsely label White defendants as low-risk (false negatives). The false positive rate for Black defendants was approximately twice that for White defendants.

The COMPAS controversy produced an important technical dispute about what “fairness” means:

Northpointe (COMPAS’s developer) responded that COMPAS was “calibrated” — that a risk score of 7 predicted approximately the same recidivism rate regardless of race. This is a different definition of fairness than ProPublica’s (equalized false positive rates). Researchers Chouldechova (2017) and Kleinberg et al. (2016) showed formally that these different fairness definitions cannot all be satisfied simultaneously when base rates differ across groups. If Black defendants are arrested at higher rates than White defendants (due to over-policing of Black communities, independent of actual crime rates), then any calibrated risk tool will necessarily have unequal false positive rates.

The COMPAS debate revealed that “fairness” is not a single technical criterion but a family of conflicting mathematical definitions that reflect different normative choices. Choosing which fairness criterion to optimize is a political and ethical decision that technical systems cannot make themselves.

Timnit Gebru, Emily Bender, and “Stochastic Parrots”

Timnit Gebru was co-lead of Google’s Ethical AI team. In December 2020, she was fired — or, as Google characterized it, resigned — under circumstances that became one of the highest-profile controversies in the history of AI research.

The immediate cause was a paper: “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” co-authored by Gebru, Emily Bender (University of Washington), Angelina McMillan-Major, and Shmargaret Shmitchell. The paper argued that large language models (LLMs) had significant costs and risks that were understated in the research community:

Environmental cost: Training LLMs required enormous computational resources with associated carbon emissions.
Training data: LLMs trained on internet data encode the biases, stereotypes, and hateful content present in that data.
Stochastic parrots: LLMs produce text that is statistically coherent with training data but without genuine comprehension — a risk because the fluency of generated text may be mistaken for correctness or understanding.

Google asked Gebru to either remove her name from the paper or withdraw it. When she asked for an explanation of the specific concerns, she was fired. Google’s stated reason was that the paper had not gone through internal review processes; Gebru and supporters argued that the firing was retaliation for research that criticized Google’s core business.

The episode catalyzed organizing within the AI research community. Over 2,600 Google employees signed a letter of protest. Gebru founded the Distributed AI Research Institute (DAIR) in 2021. Margaret Mitchell (who wrote the paper under the pseudonym “Shmargaret Shmitchell”), the other co-lead of Google’s Ethical AI team, was also fired in February 2021 after sending documents related to Gebru’s case to outside parties. Google restructured its AI ethics research under closer management.

The “Stochastic Parrots” paper itself was highly cited and influenced subsequent LLM safety and ethics research. “Stochastic parrots” became standard terminology for critiques of LLM fluency without understanding.

The Structural Problem: Corporate AI Ethics

Gebru’s case illustrated a broader structural problem in AI ethics: the tension between research that identifies harms in corporate AI systems and the commercial interests of corporations that fund that research.

Internal AI ethics teams at major tech companies — Google, Facebook/Meta, Microsoft — were created in response to public and regulatory pressure. They produced valuable research and contributed to policy debates. They also operated within companies whose core products were the AI systems being studied, creating pressure — sometimes explicit, sometimes through incentives and organizational structure — to limit research that might damage commercial products or attract regulation.

Researchers who published findings critical of their employers’ products risked retaliation. Researchers who found no problems faced questions about objectivity. The institutional structure of corporate AI ethics created inherent conflicts between research independence and corporate employment.

Independent AI ethics organizations — DAIR, the Algorithmic Justice League (Buolamwini’s organization), the AI Now Institute, the Partnership on AI — provided alternative institutional homes. These organizations were less financially constrained by corporate interests but had less access to proprietary systems and data.

Fairness, Accountability, and Transparency (FAccT)

The academic community organized around these problems through the FAccT conference (Fairness, Accountability, and Transparency, formerly FAT*), which became a major venue for technical and social science research on AI bias and accountability. FAccT papers established empirical results about algorithmic discrimination in hiring, lending, advertising, and criminal justice; developed technical methods for measuring and mitigating bias; and produced theoretical frameworks for thinking about when algorithmic decision-making is appropriate.

The policy outputs of this research community contributed to: the EU AI Act (2024), which classifies certain AI applications as high-risk and requires bias testing and documentation; the NIST AI Risk Management Framework (2023); and proposed regulations in several US states and cities targeting specific applications (facial recognition, tenant screening, hiring algorithms).