Florence, Italy. October 2012. The International Conference on Computer Vision — ICCV — is one of the three most prestigious venues in the field, the place where the best computer vision researchers come to present their work and learn from each other. In the afternoons between sessions, researchers gather in hotel lobbies and conference corridors and talk about results.

This year they are talking about one result in particular. The ImageNet Large Scale Visual Recognition Challenge — the ILSVRC — has just released its 2012 results. The challenge, which has been running since 2010, measures how accurately computer vision algorithms can classify images from the ImageNet dataset into one of a thousand categories. The best systems in 2010 and 2011 had achieved top-5 error rates of around 26%. In top-5 accuracy, an algorithm is considered correct if the right answer is anywhere in its top five predictions.

The best 2012 system, submitted by a team from the University of Toronto called SuperVision, had achieved a top-5 error rate of 15.3%.

In a field where annual improvements were typically measured in fractions of a percentage point, a 10-point improvement in a single year is — there is no other word for it — impossible. And yet the results are there, independently verified, on the official ILSVRC leaderboard.

The researchers in the Florence hotel lobbies are not celebrating. They are trying to understand what they are seeing. Some of them are defensive. They have spent careers developing the hand-crafted feature representations — SIFT descriptors, Histogram of Oriented Gradients — that have defined computer vision for the past decade. That work is being made obsolete in real time, and they are watching it happen.

Others are genuinely excited, recognising that what they are seeing is a genuine scientific breakthrough — something that will change everything — even if they are not entirely sure yet what it means or where it leads.

What they are seeing is AlexNet. And it is the beginning of the modern AI era.


The Team: Three People and Two GPUs

The SuperVision team that submitted AlexNet to the 2012 ILSVRC consisted of three people: Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton.

Krizhevsky was a PhD student in Hinton’s group at the University of Toronto. He had come to Toronto as an undergraduate and had worked with Hinton since his bachelor’s degree. His specific genius was systems implementation — an extraordinary ability to turn theoretical ideas into working computational systems that ran efficiently on available hardware. He was the person who would take the convolutional network architecture and make it run fast enough, on the hardware available, to train on the full The ImageNet Project dataset.

Sutskever was also a PhD student in Hinton’s group, though at a more advanced stage. He had co-authored with Hinton on several papers about restricted Boltzmann machines and deep belief networks. His specific strengths were in the theory of learning algorithms and in the development of training techniques that made deep networks easier to train. He would contribute to the specific training choices — the learning rate schedule, the momentum parameters, the dropout regularisation — that made AlexNet’s training stable and effective.

Hinton was the supervisor — the person who had spent thirty years working toward this moment, who understood better than either of his students what the system was doing and what it meant, who had the intellectual authority to make the case for the approach to a sceptical field, and who had the social capital that would eventually allow the breakthrough to be recognised for what it was.

The three of them worked in close physical proximity through the spring and summer of 2012, in the computational facilities at the University of Toronto and at Hinton’s home, training a network that required weeks of continuous GPU computation.


The Hardware: Two Gaming GPUs Change History

One of the most striking facts about AlexNet’s development is the hardware it ran on. The system that would change the history of AI was trained on two NVIDIA GTX 580 GPUs — the same graphics processing units that gamers were buying to play video games in their living rooms.

NVIDIA’s GTX 580 had been designed to render the complex 3D graphics of video games. It had 512 processing cores, each capable of performing mathematical operations in parallel. For game graphics, these cores performed matrix multiplications on polygon coordinates and texture maps. For neural network training, they performed essentially the same operations on weight matrices and activation values.

The insight that GPU hardware was well-suited to neural network computation had been noted by several researchers in the years before AlexNet. GPUs were designed for exactly the kind of massively parallel floating-point computation that neural network training required. NVIDIA had released CUDA in 2007 — a programming interface that allowed general-purpose computing on their GPU hardware — and researchers had been exploring its use for neural network training since then.

Krizhevsky took this possibility further than anyone had previously. He wrote cuda-convnet — a highly optimised CUDA implementation of convolutional neural network training that was dramatically faster than CPU implementations and faster than previous GPU implementations. The optimisation work was itself a substantial engineering achievement, requiring deep understanding of GPU architecture, memory access patterns, and the specific computational structure of convolutional operations.

The two GTX 580 GPUs that Krizhevsky used represented a specific limitation he had to work around. Each GPU had 3 gigabytes of video memory — enough to store a significant portion of the network but not the entire network. He divided the network across two GPUs, with specific communication between them at certain layers. This two-GPU implementation was both a constraint — it required careful management of which computations happened on which GPU — and an advantage, since the two GPUs together had more memory and computing power than a single GPU would have.

The total cost of the hardware — the two GTX 580 GPUs — was roughly $2,000 at 2012 prices. The most powerful computer vision system in the world was trained on consumer hardware that cost less than most people’s laptops.


The Architecture: Eight Layers That Learned to See

AlexNet’s architecture — the specific arrangement of layers, the connection patterns, the non-linear activation functions — was not invented from scratch. It built on LeCun’s convolutional network work, incorporated specific innovations that Hinton’s group had been developing, and added specific components that Krizhevsky had found to work well in preliminary experiments.

The network had eight learned layers: five convolutional layers followed by three fully connected layers. This was deep by the standards of the time — most successful visual recognition systems had been much shallower — but not as deep as the ResNets and other architectures that would follow in subsequent years.

The convolutional layers were the heart of the architecture. Each convolutional layer learned a set of filters — patterns of pixel relationships that the layer was trained to detect. At the first layer, these filters learned to detect basic visual features: oriented edges, blobs of colour, high-contrast boundaries. At subsequent layers, the filters combined these basic features into more complex patterns: corners, curves, textures, parts of objects. At the final convolutional layer, the filters responded to abstract visual patterns — wheel-like structures, face-like arrangements of features, stripe patterns — that were combined by the fully connected layers into object identities.

The learning was the remarkable thing. These filters were not designed by human engineers or computer vision researchers. They were learned from data — adjusted by Backpropagation Goes Mainstream to minimise the error rate on ImageNet classification. The specific filters that emerged from training were determined entirely by the statistics of the training data and the architectural constraints of the network.

When researchers visualised the filters that AlexNet had learned at the first layer, they were immediately recognisable as similar to the features that Hubel and Wiesel had found in the primary visual cortex of cats and monkeys: oriented edge detectors at different angles, colour-selective cells, blobs. The network had independently discovered, through learning on photographic data, the same basic visual features that evolution had built into biological visual systems. The convergence between learned and biological visual features was striking.

ReLU activation. AlexNet used rectified linear unit (ReLU) activations — the function f(x) = max(0, x) — rather than the sigmoid or tanh activations that had previously been standard in neural networks. ReLUs trained dramatically faster than sigmoid activations because they did not saturate: sigmoid activations produced gradients near zero for strongly positive or negative inputs (the vanishing gradient problem in a different form), while ReLUs produced gradients of either zero (for negative inputs) or one (for positive inputs). The use of ReLUs was one of the specific innovations that made training a deep network on ImageNet feasible in weeks rather than months.

Dropout regularisation. One of the most important innovations in AlexNet was the use of dropout regularisation. Dropout, developed by Hinton and his students (including Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov), worked by randomly setting half of the hidden unit activations to zero during training. This prevented individual neurons from co-adapting — from developing complementary specialisations that worked together on the training data but failed to generalise to new data. Dropout forced each neuron to develop representations that were useful in a wide range of network configurations, improving generalisation.

Dropout was a specific response to the overfitting problem that had limited deep network training. With millions of parameters and a training dataset of only one million images, AlexNet was at risk of memorising the training data rather than learning generalisable representations. Dropout provided a practical and effective regularisation strategy that significantly improved generalisation performance.

Data augmentation. AlexNet was trained with extensive data augmentation — the generation of modified training examples through transformations like cropping, flipping, and colour jittering. Each training image was presented to the network in dozens of different forms, dramatically increasing the effective size of the training dataset. Data augmentation is now a standard technique in deep learning, and AlexNet was one of its earliest large-scale applications.

Local response normalisation. AlexNet used local response normalisation — a form of lateral inhibition inspired by biological neural systems — between certain layers. This technique, which reduced the activation of neurons in regions where many neurons were strongly activated, provided a specific form of regularisation that improved generalisation on the ImageNet task.


The Training: Five Days That Changed Computer Vision

Training AlexNet on ImageNet took approximately five to six days on the two GTX 580 GPUs. This was fast by the standards of the time — a CPU implementation would have taken months — but it was still a significant period of continuous computation that required careful monitoring.

The training procedure involved exposing the network to the approximately 1.2 million training images from the ImageNet ILSVRC dataset repeatedly — each pass through the full dataset was one “epoch,” and AlexNet was trained for approximately 90 epochs. At each training step, a batch of images was presented to the network, the forward pass computed the network’s predictions, the error was computed against the true labels, and the backward pass computed the gradients and updated the weights.

The learning rate — the step size of the weight updates — was one of the most critical hyperparameters. Too large a learning rate caused the training to diverge; too small a learning rate made training unnecessarily slow. Krizhevsky and Sutskever used a specific learning rate schedule — starting with a larger rate and gradually reducing it as training progressed — that had been found to work well in their preliminary experiments.

The training was not without difficulties. The two-GPU implementation required careful coordination — certain layers ran on one GPU, others on both, and the communication between GPUs had to be managed efficiently. The memory constraints of the GPUs required careful management of batch sizes and network dimensions. And the sheer duration of the training meant that the researchers had to monitor the process over days, watching the training loss decrease and the validation accuracy increase, ready to intervene if something went wrong.

The monitoring was itself an activity with specific content. Watching a deep network learn is not like watching a traditional program run. The validation accuracy — the fraction of validation images that the network correctly classified — improved gradually and non-monotonically: it generally increased over the course of training, but with fluctuations from batch to batch and epoch to epoch. Krizhevsky, Sutskever, and Hinton watched these curves, checking that the training was proceeding in the expected direction, adjusting hyperparameters if the results suggested problems, and gradually developing confidence that the system they were building would be competitive.

They did not know, during training, how competitive it would be. They had good reason to expect a strong result — the combination of deep architecture, GPU computing, and ImageNet data was everything they had been arguing would work. But the specific magnitude of the improvement — 10 percentage points better than the best previous system — was not something they could have predicted with confidence. Even the most optimistic expectations were exceeded.


The Submission and the Results

The 2012 ILSVRC accepted submissions through September 30, 2012. The SuperVision team submitted their system before the deadline, without knowing exactly how it compared to the other submissions.

The evaluation was conducted by Fei-Fei Li‘s group on a held-out test set that no team had access to during development. Each submitted system was evaluated on 100,000 test images, and the top-5 error rate was computed. The results were announced in October 2012.

When the results came in, the reaction was immediate. The second-place entry, from a team using the standard SIFT-based approach that had been the field’s most effective method, had achieved an error rate of approximately 26% — consistent with the previous year’s best results. SuperVision had achieved 15.3%.

The magnitude of the difference was not immediately comprehensible. In a field where annual improvements were measured in fractions of a percentage point, a 10-point improvement in a single year looked like an error. Researchers who saw the results initially checked whether they had misread the numbers. They had not.

The specific structure of the top-5 error metric made the result even more dramatic. Top-5 accuracy means the system is evaluated on whether the correct label is among its five highest-probability predictions. Achieving 15.3% top-5 error means that 84.7% of the time, the system’s top five guesses included the correct category. For a 1000-category classification problem, this was a level of accuracy that had seemed years away.

The reaction at the Florence ICCV conference a few weeks later was the beginning of a conversation that has not stopped since. Computer vision researchers who had spent their careers developing hand-crafted feature representations — SIFT, HOG, and their combinations — understood immediately that their approaches had been superseded. The question was not whether to shift to deep learning but how fast to make the transition.


The Immediate Aftermath: A Field Reconverts

The speed with which the computer vision community shifted toward deep learning after AlexNet was striking. Within the year following the 2012 ILSVRC results, the majority of competitive computer vision research groups were working on deep learning approaches.

The shift was driven by the clarity of the evidence. AlexNet’s improvement over the previous year’s best system was large enough that it could not be explained away as a consequence of overfitting to the specific test set, or as a result of unusual data augmentation, or as a fluke of the specific evaluation procedure. The improvement was genuine, it was large, and it came from an approach that was straightforwardly extendable to harder problems.

The specific impact on different subgroups within computer vision was interesting. Researchers who had been working on the theoretical foundations of SIFT and HOG — the gradient-based feature descriptors that had dominated the field — faced the most direct obsolescence. The specific features they had carefully designed were being replaced by learned features that performed better on the standard benchmarks.

Researchers who had been working on classification and recognition more broadly had a different experience: the approach that had superseded their specific methods was one they could adopt. The transition was painful but not existential. You learned to train convolutional networks, you applied the dropout and data augmentation techniques that AlexNet had demonstrated, and you competed on the same benchmarks with the new tools.

Researchers who had been working on related but distinct problems — object detection, semantic segmentation, image captioning — saw AlexNet as an enabling technology rather than a replacement. Deep convolutional network features could be applied to these problems too, and the years following AlexNet saw rapid advances in all of them as researchers adapted the AlexNet approach to their specific tasks.


The Technology Giants Respond: The Great AI Hiring

The AlexNet result reached the major technology companies almost simultaneously with the research community, and the response was rapid.

Google, Facebook, Microsoft, Baidu, and Twitter all recognised the significance of the result within weeks of the Florence announcement. Each company was already involved in AI research to varying degrees, but the AlexNet result made the strategic importance of deep learning concrete in a way that previous results had not.

The specific strategic significance was different for different companies.

For Google, whose search and advertising business depended heavily on understanding images and text, the ability to dramatically improve image recognition was directly commercially relevant. Google’s existing image recognition capabilities were limited, and AlexNet’s result suggested that deep learning could close the gap between machine and human visual performance faster than anyone had expected.

For Facebook, which was building social features around shared photos, deep learning for image recognition was directly relevant. The ability to automatically identify people, places, and objects in user-uploaded photos had obvious product applications.

For Microsoft, which was competing with Google in search and with Apple in mobile computing, deep learning represented a competitive imperative — a capability that competitors were going to develop and that Microsoft could not afford to lag behind in.

The response from these companies was primarily a hiring response. Within months of the AlexNet result, every major technology company was aggressively recruiting the neural network researchers who had been working in the margins of machine learning for the past two decades.

Geoffrey Hinton, Alex Krizhevsky, and Ilya Sutskever formed a company — DNNresearch — specifically to be acquired. They ran an auction among the interested companies, and Google won, paying approximately forty-four million dollars — an extraordinary sum for three researchers with no product and no revenue. Krizhevsky eventually left to pursue other opportunities. Sutskever would go on to co-found OpenAI. Hinton joined Google’s research division, where he worked for a decade before his resignation in 2023.

Yann LeCun was recruited by Facebook to lead its AI research division, leaving his long-standing position at NYU. Yoshua Bengio remained at Montreal but entered into close industrial partnerships that gave Mila access to computing resources and collaborative relationships with industry that transformed the scale of its research.

The hiring wave also swept up the next generation — the graduate students and postdoctoral researchers who had been working on deep learning approaches and who were suddenly in enormous demand. Students who had been uncertain about their career prospects, given that they had trained in an unfashionable research direction, found themselves with multiple attractive offers from the world’s most prominent technology companies.


The AlexNet Paper: Why It Was Written the Way It Was

The AlexNet paper — “ImageNet Classification with Deep Convolutional Neural Networks” by Krizhevsky, Sutskever, and Hinton — was published in the proceedings of the Neural Information Processing Systems (NIPS) conference in 2012. The paper is worth examining carefully, because the way it was written reflects both the confidence and the modesty with which the Toronto group understood what they had achieved.

The paper is concise — eight pages, plus references. It describes the architecture precisely and describes the specific design decisions — ReLU activations, dropout, local response normalisation, data augmentation, the two-GPU implementation — with clear explanations of why each decision was made and what effect it had on performance.

What the paper is careful about is the claims it makes. It does not claim to have solved computer vision. It does not claim that deep learning will definitively replace other approaches in all settings. It presents the specific result — a system that achieves 15.3% top-5 error rate on the ILSVRC 2012 test set — and provides a careful description of how that result was achieved.

The restraint is characteristic of Hinton’s approach to scientific communication, developed through decades of working in a marginal research tradition where overclaiming had damaged the field’s credibility. The AlexNet paper is confident — it knows the result is important — but it does not oversell. It presents the result and the methodology and allows the numbers to make the case.

The paper also includes something unusual for a machine learning paper of its era: visualisations of the features learned by the network’s first layer. The visualisations show the 96 filters learned by the first convolutional layer, which are recognisably similar to the oriented edge detectors and frequency-selective filters found in biological visual systems. This visualisation was not technically necessary for reporting the result — it could have been omitted without affecting the paper’s main claims. But it was included because it illustrated something important about what the network was learning and why the approach was connected to the biological understanding of visual processing.


The 2013–2016 ILSVRC: A Revolution in Real Time

The subsequent years of the ILSVRC competition provide a remarkable record of how rapidly deep learning capabilities advanced once the AlexNet breakthrough had established the paradigm.

2013. ZFNet, from Matthew Zeiler and Rob Fergus, improved on AlexNet by analysing what the network’s intermediate layers were learning — identifying which features triggered specific neurons and using this understanding to improve the architecture. ZFNet achieved 11.7% top-5 error, compared to AlexNet’s 15.3%.

2014. GoogLeNet (from Google) and VGGNet (from Oxford) both achieved top-5 error rates below 7%. GoogLeNet used an “inception module” that combined filters of different sizes in a single layer, allowing the network to capture visual patterns at multiple scales simultaneously. VGGNet demonstrated that a simple architecture of small convolutional filters, stacked to a depth of 16-19 layers, could achieve excellent performance, establishing the importance of depth as a design principle.

2015. ResNet, from a Microsoft Research team led by Kaiming He, achieved a top-5 error rate of 3.6% — lower than human performance on the specific task (approximately 5%). ResNet used residual connections — “skip connections” that allowed the gradient signal to bypass certain layers during backpropagation — to enable the training of networks with more than 100 layers. The residual connection was a direct solution to the vanishing gradient problem: by creating a path through which the gradient could flow without being multiplied by many layer weights, residual connections made deep networks trainable.

The progression from AlexNet’s 15.3% in 2012 to ResNet’s 3.6% in 2015 — in just three years — represented the fastest sustained improvement in performance on a standard benchmark in the history of AI. The improvement was driven by the combination of architectural innovation, the availability of the E13 dataset, and the GPU computing that made training large networks practical.

By 2016, the classification problem that ILSVRC measured was considered largely solved — the best systems had exceeded human performance, and the competition was discontinued in its traditional form. The research community’s attention shifted to harder problems: object detection, instance segmentation, image captioning, visual question answering — problems where the progress made on classification could be leveraged but where substantial challenges remained.


The Transfer Learning Revolution: AlexNet Features Everywhere

One of the most practically significant consequences of AlexNet was the discovery — made by several groups in the months following the 2012 result — that features learned by deep networks trained on ImageNet could be successfully transferred to other visual recognition tasks.

The transfer learning paradigm was straightforward: take a network trained on ImageNet, remove the final classification layer, and use the remaining layers as a feature extractor for a new task. Train a simple classifier — a linear classifier or a support vector machine — on the features produced by the ImageNet-trained network, using a much smaller labelled dataset for the new task.

The results were remarkable. ImageNet features transferred successfully to tasks involving objects that were not in ImageNet — medical images, satellite imagery, artistic images — suggesting that the representations learned on ImageNet captured visual features that were general and reusable, not specific to the specific objects in ImageNet.

This discovery changed the economics of computer vision AI. Previously, training a competitive visual recognition system for a new task required a large labelled dataset for that specific task — thousands or millions of labelled examples. With ImageNet pre-training, a competitive system could sometimes be built with hundreds or thousands of labelled examples, by fine-tuning the pre-trained features on the new task.

The medical imaging applications that had been promised by the The Rise of Expert Systems era in the 1980s but never delivered became feasible in the post-AlexNet era. Radiologists had always been sceptical that AI could read medical images at clinician level; the combination of ImageNet pre-training and fine-tuning on medical image datasets produced systems that demonstrated that scepticism might be wrong. Deep learning systems achieved clinician-level performance at identifying diabetic retinopathy from retinal photographs within a few years of AlexNet.


The Broader Significance: What AlexNet Actually Proved

The significance of AlexNet extends beyond the specific result on the specific benchmark. It was a demonstration of several principles that would shape AI development for the subsequent decade.

Scale matters. AlexNet demonstrated that scaling — using a larger network trained on a larger dataset with more computing power — produced dramatically better results than careful tuning of smaller networks on smaller datasets. This lesson — that scale was a productive research direction — was absorbed throughout the field and contributed to the scaling philosophy that has dominated large language model research.

Architecture matters, but not as much as training. AlexNet was not a fundamentally novel architecture — convolutional networks had been around since LeCun’s work in the late 1980s. What was new was the combination of specific training choices — ReLU activations, dropout, data augmentation — that made training a deep network on a large dataset stable and effective. The lesson was that architectural innovation and training methodology were both important, but training methodology had been underemphasised.

Learned features outperform hand-crafted features at scale. The decades-long tradition of developing hand-crafted visual features — SIFT, HOG, SURF — was demonstrated to be inferior to features learned from data at the scale of ImageNet. The lesson — that learning from data was more productive than manual feature engineering — had been argued for decades by neural network researchers and was finally demonstrated conclusively.

GPU computing transforms AI research. The demonstration that consumer-grade GPU hardware could train state-of-the-art AI systems changed the economics of AI research and AI development. Academic researchers could now run experiments that would previously have required access to supercomputing facilities. The democratisation of high-performance AI computing — imperfect and ongoing, but beginning — was initiated by AlexNet’s GPU-based training.


The People Remember

There is something specific about watershed moments in science — moments that are recognised as watershed moments while they are happening — that creates unusually clear memories. The people who were present remember where they were, what they were doing, who they were with.

The computer vision researchers who were at ICCV 2012 in Florence when the AlexNet results were announced remember the experience clearly. The conversations in hotel corridors. The initial disbelief at the magnitude of the result. The gradual recognition that something fundamental had changed. The mixed emotions — excitement for the field, anxiety about careers built on now-superseded approaches.

The graduate students at major computer vision groups who read the AlexNet paper in October 2012 remember the experience of reading it and understanding what it meant. Some of them changed their research directions based on that paper, pivoting from the approaches they had been working on to deep learning. Some of those pivots turned out to be the right career choices and defined the next decade of their research.

The deep learning researchers — the neural network underground who had been working toward something like this result for years — remember the validation. Not celebration exactly, but something more serious: the recognition that the thing they had believed was right was right, that the decades of marginalisation had been a consequence of being ahead of the world rather than behind it.

Geoffrey Hinton, in interviews conducted in the years after AlexNet, speaks about the result with the measured satisfaction of a person who had expected it to happen and is not surprised that it happened — but who had not been certain it would happen on this specific timeline. The result confirmed what he had believed. It also exceeded what he had specifically predicted. Both of these things were true.

Alex Krizhevsky, who did the core implementation work, is characteristically quiet in public about what the experience meant to him. He made the system that changed the world, and then he moved on to other things.


After AlexNet: The World It Made

The world that AlexNet made — the world in which deep learning is the dominant paradigm for AI research and for AI applications — looks very different from the world it came from.

The machine learning research community has been transformed. The venues that once published primarily statistical learning theory and Bayesian inference now publish primarily deep learning research. The researchers who once led the field through SVMs and graphical models have either retrained in deep learning or found themselves in a new minority position, making the same arguments for theoretical rigour that the neural network community once made from the other side.

The technology industry has been transformed. The major technology companies have invested billions of dollars in deep learning research and infrastructure. The systems that handle image recognition, speech recognition, machine translation, and content recommendation at global scale are built on deep learning. The AI capabilities that were theoretical possibilities in 2011 are deployed products in 2023.

The broader AI research agenda has been transformed. The specific problems that AI research pursues — and the specific approaches used to pursue them — have shifted dramatically. The progress on language understanding that has produced Transformer Revolution, the progress on protein structure prediction that has produced AlphaFold, the progress on game playing that has produced AlphaStar and AlphaZero — all of these descend from the paradigm shift that AlexNet initiated.

The single most important catalyst for all of this was a 15.3% top-5 error rate on an image recognition benchmark, achieved in the fall of 2012 by three researchers at the University of Toronto, using two gaming GPUs that cost $2,000, training for five days.


The Starting Gun

AlexNet was not the origin of deep learning. That origin goes back to the perceptron, to backpropagation, to the PDP volumes, to the work of the underground period. It goes back to LeCun’s LeNet and to Hinton’s decades of sustained conviction that this approach was right.

AlexNet was the starting gun — the moment when it became undeniable that deep learning had arrived, that the approach that the underground had been developing for decades was not just promising but transformative, that the future of AI would be written by the people who understood how to train large neural networks on large datasets using GPU computing.

The starting gun did not begin the race. It announced that the race had begun. The runners were already at the starting line. Some of them had been there for thirty years.

When the gun went off in October 2012, they ran.


Further Reading

  • “ImageNet Classification with Deep Convolutional Neural Networks” by Krizhevsky, Sutskever, and Hinton (2012) — The AlexNet paper. Read it carefully. It is one of the most important scientific papers of the century.
  • “Visualizing and Understanding Convolutional Networks” by Zeiler and Fergus (2014) — The ZFNet paper, which provides the most illuminating visual account of what deep convolutional networks actually learn. Essential for understanding the AlexNet architecture.
  • “Very Deep Convolutional Networks for Large-Scale Image Recognition” by Simonyan and Zisserman (2015) — The VGGNet paper, which established the importance of depth as a design principle and whose architecture remained influential for years.
  • “Deep Residual Learning for Image Recognition” by He, Zhang, Ren, and Sun (2016) — The ResNet paper, which solved the degradation problem in very deep networks and achieved superhuman performance on ImageNet.
  • “How ImageNet Sparked an AI Revolution” — Various retrospective accounts — Multiple journalists and researchers have written retrospective accounts of the 2012 ILSVRC and the AlexNet result. MIT Technology Review’s account and Fei-Fei Li’s memoir provide complementary perspectives.

Next in the Events series: E15 — The Attention Mechanism, 2017: The Transformers Change Everything — The full story of the “Attention Is All You Need” paper — how a Google Brain team developed a new architecture for sequence modelling, why it worked better than everything that came before, and how it became the foundation for every large language model in existence. The paper that made GPT possible.


Minds & Machines: The Story of AI is published weekly. If the AlexNet story — the breakthrough that nobody saw coming, the five days of GPU training that changed the world, the starting gun that began the modern AI era — clarifies something about how scientific revolutions actually happen, share it with someone who would find that clarity worth having.