Yoshua Bengio and the Montreal School
Zusammenfassung
Yoshua Bengio spent fifteen years publishing deep learning research that almost nobody cited, in a city not known as a technology hub, at a university not associated with AI. He kept working because he thought the approach was right and the questions were important. The 2012 ImageNet revolution proved him correct, and the decade that followed made him one of the most cited computer scientists in history. He co-founded MILA — which grew into the world’s largest academic AI research institute — transformed Montreal into a global AI center, and shared the 2018 Turing Award with Hinton and LeCun. Then ChatGPT arrived, and one of the principal architects of modern AI became one of the field’s most serious institutional critics.
Montreal, and the Freedom of Peripheral Distance
Yoshua Bengio was born on March 5, 1964, in Paris, France, to a Moroccan Jewish family. He grew up in Montreal, attended McGill University, went to MIT for graduate work, returned to Canada for a postdoc at Bell Labs Montreal and then at INRS, and accepted a faculty position at the Université de Montréal in 1993. He has been there ever since, a career choice that was in some respects accidental and in others determinative.
The peripheral geography mattered. Montreal in the 1990s was a good research university town, a culturally distinctive city, and emphatically not Silicon Valley or Boston or even Toronto. The pressure to align with fashionable research directions — pressure that, at MIT or Stanford, arrived in the form of funding cycles, hiring competition, and student preferences — was lighter. Bengio could work on what he thought was important, which meant neural networks at a time when the field had moved on to other approaches.
His early collaborations were with the small group that was keeping connectionist research alive: Geoffrey Hinton in Toronto, Yann LeCun at Bell Labs, and several others scattered across a few institutions. They were sometimes collectively dismissive about the field’s direction, sometimes simply ignored, and occasionally aware that they were making progress that the mainstream had not yet recognized. Bengio’s contribution to this community was partly technical and partly institutional — he was building something in Montreal that would outlast any individual result.
Recurrent Networks and the Problem of Long-Term Dependencies
Bengio’s early research focused on recurrent neural networks (RNNs) — networks with connections that fed back in time, allowing them to process sequences. RNNs were the natural architecture for language, speech, and time series, but they had a fundamental problem: the vanishing gradient.
When training an RNN with backpropagation through time, the gradient signal had to travel backward through many time steps. In practice, it either vanished — becoming too small to meaningfully update weights far in the past — or exploded — growing uncontrollably. This made it effectively impossible for RNNs to learn relationships between events separated by many time steps, severely limiting their usefulness for natural language where long-distance dependencies were common.
In 1994, Bengio, Simard, and Frasconi published a careful theoretical analysis of this problem, “Learning Long-Term Dependencies with Gradient Descent is Difficult.” The paper was precise about why the problem occurred and pessimistic about gradient-based solutions. It helped motivate Jürgen Schmidhuber and Sepp Hochreiter’s development of Long Short-Term Memory (LSTM) networks in 1997 — an architecture that explicitly maintained and forgot information using gating mechanisms, providing a structural solution to the gradient problem rather than an optimization one.
The 1994 paper is characteristic of Bengio’s approach: careful theoretical analysis of why something is hard, even when the finding was primarily negative. His contributions to deep learning theory — understanding why it works, not just demonstrating that it does — would mark his research program throughout.
The Neural Language Model: Word Embeddings Before Word Embeddings
In 2003, Bengio, Ducharme, Vincent, and Jauvin published “A Neural Probabilistic Language Model” in the Journal of Machine Learning Research. It is one of the papers that is most frequently described as “ahead of its time” — a characterization that is accurate but incomplete. It was not just early; it was foundational in ways that became clear only when the approaches it introduced became standard.
The paper trained a feedforward neural network to predict the next word in a sequence given the previous words. This was a language modeling task — the same task that n-gram models, then the standard approach in natural language processing, performed by counting word sequences in large corpora. The neural approach outperformed n-gram models on perplexity benchmarks, but the improvement was modest and the computational cost was much higher. The field did not immediately convert.
What the paper introduced, almost as a side effect of the architecture, was distributed representations of words: each word in the vocabulary was mapped to a dense vector in a continuous space — a point in a high-dimensional geometry. Two words that appeared in similar contexts would, after training, have similar vector representations. The geometry of the space encoded semantic relationships. Words for animals clustered together. Words for countries clustered together. You could, if you looked carefully, find that the vector for “king” minus the vector for “man” plus the vector for “woman” was close to the vector for “queen.”
Info
This concept — word embeddings — became one of the most important technical ideas in natural language processing. Word2Vec (Mikolov et al., 2013) popularized the approach with an efficient training algorithm, and GloVe (Pennington et al., 2014) provided a theoretically motivated alternative. But the core idea — that words could be represented as points in a continuous geometric space where proximity corresponded to semantic relatedness — was first demonstrated in Bengio’s 2003 paper. The token embeddings used in every Transformer-based language model, including GPT-4 and Claude, are the direct descendants of this idea.
The 2003 paper was highly cited within the neural network community but had limited immediate impact on the broader NLP field. Statistical NLP, with its phrase-based models and explicit feature engineering, remained dominant through the 2000s. The paper was one of several Bengio contributions that preceded their own impact by roughly a decade.
MILA and the Montreal Ecosystem
In 1993, Bengio founded what would eventually become MILA — the Montreal Institute for Learning Algorithms. It began as a small research group at the Université de Montréal, focused on machine learning theory and applications. It grew, slowly at first and then rapidly after the 2012 ImageNet revolution (covered in ImageNet and the Deep Learning Revolution), into the world’s largest academic AI research institute.
By 2025, MILA had more than 900 researchers — faculty, postdocs, and graduate students — working across AI research from theoretical foundations to applied systems. Its output measured by publications, citations, and graduates employed in industry and academia was comparable to or greater than AI groups at MIT, Stanford, and Berkeley.
The institute’s scale made Montreal a genuinely different kind of AI hub than San Francisco or London. Unlike Silicon Valley’s concentration of commercial labs, Montreal’s density was academic: researchers oriented toward publication, collaboration, and long-term questions rather than product cycles. This attracted a particular kind of talent — people who wanted to do fundamental research in a city with low cost of living, a strong French-Canadian cultural identity, and proximity to a large academic community.
Bengio actively shaped the Canadian government’s investment in this ecosystem. He was a central figure in the Pan-Canadian AI Strategy, announced in 2017, which invested CAD $125 million in three national AI institutes: MILA in Montreal, the Vector Institute in Toronto, and Amii in Edmonton. The strategy was explicitly modeled on the insight that geographic concentration of research talent had been decisive in Silicon Valley and London, and that Canada could compete by building similar concentrations in specific cities.
Element AI, founded in 2016 by MILA alumni including Jean-François Gagné, became one of the most prominent AI startups in Canada before being acquired by ServiceNow in 2020. Major technology companies — Google, Microsoft, DeepMind, Meta, Samsung — opened Montreal research offices to access the talent MILA produced. The city that had been a research backwater became, in a period of about ten years, one of the four or five most important AI research locations in the world.
Attention Mechanisms: The Ingredient the Transformer Needed
Bengio’s lab made the contribution that most directly connects his work to contemporary AI systems. In 2014, Dzmitry Bahdanau — a graduate student at MILA — developed attention mechanisms for neural machine translation, in a paper co-authored with Cho and Bengio: “Neural Machine Translation by Jointly Learning to Align and Translate.”
The problem the paper addressed was specific. Sequence-to-sequence models for machine translation at the time compressed the entire source sentence into a fixed-size vector before generating the translation. This was a bottleneck: long sentences produced poor translations because the fixed vector couldn’t capture everything relevant. Bahdanau’s solution was to allow the decoder to attend to different parts of the source sentence at each step of generating the translation — computing a weighted average over the encoder’s output, where the weights expressed which source positions were most relevant to the current translation step.
This worked dramatically better, especially for long sentences. More importantly, the attention weights could be visualized, showing which source words the model was “looking at” when producing each target word. The model was doing something interpretable.
Info
The Bahdanau attention mechanism was the direct precursor to the 2017 “Attention Is All You Need” paper from Google (Vaswani, Shazeer, Parmar et al.), which introduced the Transformer architecture by removing the recurrent structure entirely and building the entire network around attention. Modern large language models — every GPT model, Claude, Gemini, LLaMA — use Transformer architectures built on the attention mechanism that Bengio’s lab first demonstrated. The intellectual chain from Bengio’s 2003 word embeddings through the 2014 attention paper to the 2017 Transformer is direct.
Three Godfathers and the Turing Award
By 2015, the narrative of deep learning’s history had consolidated around three names: Hinton, LeCun, and Bengio. They were called the “Godfathers of Deep Learning” — a label that captured the genuine intellectual debt the field owed to researchers who had maintained their research program through years when it was neither well-funded nor well-respected.
The three were not a team in the conventional sense. They had different research styles, different institutional affiliations, and were increasingly in different countries: Hinton in Toronto (and Google), LeCun in New York (and Facebook), Bengio in Montreal. Their intellectual contributions were complementary rather than overlapping: Hinton drove backpropagation and deep belief networks, LeCun built convolutional architectures for vision, Bengio contributed language modeling, attention, and theoretical foundations.
The ACM awarded all three the Turing Award in 2018, the highest recognition in computer science, for “conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing.” The award arrived at the peak of the deep learning boom, when the techniques these three had developed were transforming every major technology company’s product strategy.
The Safety Turn
The arrival of ChatGPT in November 2022, and the subsequent deployment of increasingly capable AI systems, produced a significant change in Bengio’s public role. He had been an active researcher with limited interest in policy; he became, within months, one of the most prominent AI safety advocates in the scientific community.
He was among the first signatories of the “AI Pause” open letter in March 2023, calling for a six-month moratorium on training AI systems more capable than GPT-4. He testified before the US Senate and the Canadian Parliament. He attended and helped shape the AI Safety Summit at Bletchley Park in November 2023, one of the first major international gatherings focused on AI governance. He became a regular participant in United Nations discussions of AI risk.
His safety concerns were specific and distinguished from the existential risk framings more common in the effective altruist community. Bengio worried less about autonomous AI pursuing misaligned goals and more about near-term misuse: the use of AI systems for disinformation at scale, for accelerating biological weapons development, for concentrating economic and political power in corporations without adequate democratic oversight. He also worried about AI systems being trained to be persuasive or deceptive in ways that were not immediately apparent.
Warnung
Bengio’s position represents a particular kind of scientist’s dilemma: he helped build the tools that created the risks he now warns about, and he does not think the solution is to stop building. He has argued that the right response is engagement — working with governments and international bodies to establish governance frameworks, pushing for interpretability research to understand what AI systems are doing, and supporting the development of safety techniques alongside capability development. This position is neither pure optimism nor pure alarm, and it has been criticized from both directions: by those who think the risks are overstated and by those who think continued development is irresponsible regardless of governance.
Unlike Hinton, who left Google to speak freely, Bengio remained at MILA and the Université de Montréal. His argument was that the right response to AI risk was not disengagement from research but increased engagement with the governance and safety questions that commercial AI development was largely bypassing. The Montreal school, in this framing, was not just a technical research institution but an institution with obligations to the public whose lives would be shaped by the technology it had helped create.
Whether this obligation can be meaningfully discharged while continuing to train and publish AI systems is one of the unresolved tensions of his current position. Bengio has not resolved it. He has said, publicly, that he is uncertain about the right course of action — which is, perhaps, the honest answer.
📚 Sources
- Bengio, Y., Ducharme, R., Vincent, P. and Janvin, C.: A Neural Probabilistic Language Model, JMLR 3, 2003
- Bengio, Y., Simard, P. and Frasconi, P.: Learning Long-Term Dependencies with Gradient Descent is Difficult, IEEE Transactions on Neural Networks 5(2), 1994
- Bahdanau, D., Cho, K. and Bengio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015
- Bengio, Y., Courville, A. and Vincent, P.: Representation Learning: A Review and New Perspectives, IEEE TPAMI 35(8), 2013
- ACM Turing Award 2018 — LeCun, Hinton, Bengio citation
- MILA — Quebec Artificial Intelligence Institute
- Bengio, Y.: How Rogue AIs May Arise, Yoshua Bengio personal blog, May 2023