TL;DR · Key Takeaways
- AI did not arrive in 2022. It walked through four stages: hand-written rules, then machine learning, then deep learning, then the transformer.
- The big shift was who writes the rules. Early systems were told what to do. Modern ones figure it out from examples.
- The 2017 transformer was the unlock: it let models read in parallel and scale, so more data and bigger hardware actually paid off.
- ChatGPT felt sudden because three slow trends, data, compute, and architecture, finally lined up at the same moment.
To a lot of people, generative AI looked like it appeared overnight in late 2022. One week nobody was talking about it, the next everybody had asked a chatbot to write a poem. The truth is far less sudden. The road to ChatGPT runs back more than sixty years, through several false starts and one quiet 2017 breakthrough that almost nobody outside research noticed at the time. Understanding that road is the best way to see why today’s systems behave the way they do, and why they finally broke through when they did. The whole story turns on one question that kept getting a better answer: who writes the rules?
The age of hand-written rules
The first few decades of AI ran on instructions a human typed out by hand. A programmer sat down and wrote, in effect, a long list of if-this-then-that rules. If the patient has these symptoms, suggest this diagnosis. If the chess board looks like this, prefer that move. These were called expert systems, and inside their narrow lanes they could be impressive. MYCIN, a 1970s medical system, recommended antibiotics about as well as some specialists. In 1997 IBM’s Deep Blue beat the world chess champion using a mountain of rules and brute-force search.
The catch was that every scrap of knowledge had to be put in by a person, one rule at a time. That works for chess, where the rules are fixed and the board is small. It falls apart the moment the world gets messy. Nobody can write down every rule for understanding a sentence, because language is full of exceptions, context, and things left unsaid. An early chatbot called ELIZA from 1966 could mimic a therapist with a few clever text tricks, but it understood nothing. These systems were brittle: step one inch outside what the author imagined and they broke. The bottleneck was painfully clear. As long as humans had to hand-write the rules, AI could only ever be as broad as someone’s patience for typing them.
Letting the machine learn from examples
The next idea flipped the job around. Instead of writing the rules, what if you showed the computer thousands of examples and let it work out the rules itself? That is machine learning, and it became practical through the 1990s and 2000s. You do not tell a spam filter what spam looks like. You feed it a pile of emails already marked spam or not spam, and it learns the patterns that separate the two. Show it enough examples and it generalises to messages it has never seen.
This was a real shift in power. The human role moved from writing the rules to gathering good examples and choosing what to learn from. Recommendation engines, fraud detection, and search ranking all grew out of this era. But classic machine learning still leaned on people to hand-pick which features of the data mattered, and it struggled with raw, unstructured things like images and audio. To go further, the rule-finding itself needed to get much deeper.
Deep learning and the great surge
Deep learning is machine learning with neural networks that are many layers deep, an idea borrowed loosely from how brains stack simple units into complex behaviour. The theory had been around for decades but kept stalling, because deep networks need enormous data and enormous computing power, and neither was available at scale. That changed around 2012, when a network called AlexNet crushed the field in an image-recognition contest by running on graphics chips, the same GPUs built for video games. The result was a jolt. Deep learning could now teach itself to recognise images, transcribe speech, and translate text, and it kept getting better as you fed it more.
For language, though, there was a stubborn problem. The networks of the day read text one word at a time, in order, like a person reading with a finger on the page. That made them slow to train and forgetful over long passages. You could not simply make them huge, because the sequential reading would not parallelise across all those GPUs. The data was there and the hardware was there, but the design for using them on language was still missing. That missing piece arrived in 2017.
2017: the transformer changes everything
In 2017 a team of Google researchers published a paper with a bold title: “Attention Is All You Need.” It introduced the transformer, the architecture that every major generative AI system uses today. The trick was a mechanism called attention, which lets the model look at every word in a passage at once and weigh how much each word matters to every other. Instead of crawling left to right, it takes in the whole sentence in one sweep and figures out the connections directly.
That one change had two huge consequences. It handled long-range meaning far better, since a word at the end of a paragraph could attend directly to a word at the start. And because it processed everything in parallel rather than in sequence, it finally used all those GPUs at full tilt. Suddenly the data and the hardware that had been waiting in the wings had a design that could soak them up. Models got bigger, training got faster, and quality climbed in step. The transformer is the engine under ChatGPT, and we devote a whole later part to how attention actually works.
The ChatGPT moment, and why it landed when it did
After 2017 the transformer scaled fast. OpenAI’s GPT line grew from a research curiosity in 2018 to GPT-3 in 2020, a model with 175 billion parameters that could write strikingly fluent text. But GPT-3 lived behind a developer interface, so the public barely felt it. The thing that changed the world was a packaging decision. In November 2022 OpenAI wrapped a tuned model in a simple chat box, made it free to try, and called it ChatGPT. Within two months it had reached an estimated 100 million users, one of the fastest consumer adoptions ever recorded. The technology had been brewing for years. The chat box is what let everyone finally touch it.
This is the real answer to “why now.” Generative AI needed three things to mature together. It needed data, and decades of the internet had produced a near-endless supply of text to learn from. It needed compute, and GPUs had become powerful and affordable enough to train on that text. And it needed the right architecture, which the transformer finally supplied in 2017. Any one of them alone goes nowhere. Hand-written rules had no data engine. Early deep learning had data and compute but the wrong design for language. Only when all three arrived at once did the curve bend sharply upward, and ChatGPT was the moment that bend became visible to everyone.
▾ Go Deeper (optional, for technical readers)
What did “Attention Is All You Need” actually change? Before it, the leading sequence models were recurrent networks (RNNs and LSTMs) that processed tokens strictly in order, carrying a hidden state forward one step at a time. That sequential dependency had two costs: training could not be parallelised across the length of a sequence, and information from far back in the text tended to wash out before it was needed.
The transformer dropped recurrence entirely. Its core is self-attention, where every token computes a weighted blend of every other token in the sequence, with the weights learned from the data. Because those comparisons are just large matrix multiplications with no step-by-step ordering, the whole sequence is processed at once, which is exactly what GPUs are built to do fast. The title was a deliberate jab: earlier designs had bolted attention onto a recurrent backbone, and the paper’s claim was that you could throw the recurrence away and keep only the attention. The trade-off is that self-attention’s cost grows with the square of the sequence length, which is why context length, the subject of a later part, becomes such an expensive resource. Positional encodings were added back in so the model still knows word order, since attention by itself is order-blind.
This is Part 3 of a 30-part walk from zero to the infrastructure behind production AI. If a term here felt new, the Generative AI Complete Guide shows which later part covers it. Want the vocabulary first? See Part 2 on the words everyone uses, or start at Part 1, what generative AI actually is.
The Bottom Line
Generative AI is the latest chapter in a long story about who supplies the rules. We started by writing them by hand, which was precise but brittle. We moved to machine learning, which let systems discover rules from examples. Deep learning made that discovery far more powerful once data and GPUs showed up around 2012. And the 2017 transformer gave language models a design that could finally use all of it, which is why ChatGPT could land in 2022 and feel like magic. Nothing here was sudden, and that is the useful lesson: the next leap will also be built from pieces already visible today. Next, we zoom in on the thing at the centre of it all and ask what a “model” really is. Which stage of this history surprised you most?
References
- Attention Is All You Need (Vaswani et al., 2017, the transformer paper)
- What is deep learning? (IBM)
- Expert systems: the rule-based era (overview)
« Part 2: the GenAI words everyone uses | Generative AI Complete Guide | Next: Part 4, what a model really is »








