TL;DR · Key Takeaways
- A neural network is a stack of tiny, simple decision-makers called neurons, wired together in layers.
- Each connection has a weight, a dial that says how much one neuron’s output matters to the next. Those dials are the parameters from Part 4.
- Learning is just tuning the dials: the network guesses, sees how wrong it was, and nudges every dial a little to be less wrong next time.
- Do that across millions of examples and the dials settle into a setting that captures real patterns. No rules were written by hand; they emerged from the tuning.
We have called a model a giant function with billions of dials, and said that training sets those dials. That is true, but it skips the interesting part: how does turning dials ever produce something that can write or translate? Phase 2 of this series opens the hood, gently, and the first thing to understand is the neural network, the structure that nearly all modern AI is built from. The reassuring news is that the core idea is genuinely simple. There is no advanced math you need to follow it, only a willingness to picture a lot of very small, very dumb parts cooperating. By the end of this part the phrase “the network learned” will mean something concrete to you.
A neuron is just a tiny decision-maker
Forget biology for a moment. An artificial neuron is a small unit that takes in a few numbers, combines them, and passes one number on. That is the whole job. The combining step is where the cleverness hides. Each incoming number is multiplied by a weight, a value that says how much this particular input should count. A strong positive weight means “pay close attention to this one,” a weight near zero means “ignore it,” and a negative weight means “this one pushes the other way.” The neuron adds up all those weighted inputs and then applies a simple rule that decides how strongly to fire, sending the result onward.
Picture a tiny loan officer who only ever sees three numbers: income, debts, and age. She has learned roughly how much to trust each one, scaled by her weights, adds them into a single score, and passes along a yes-leaning or no-leaning signal. On her own she is hopelessly crude. But she is not on her own. She is one of thousands, each watching different signals, and the output of one becomes the input of the next. Stacked deep enough, these trivial decisions combine into behaviour that looks nothing like simple addition.
It helps to see the arithmetic once, with tiny numbers. Say our loan officer sees income, debts, and age scaled to simple values of 8, 2, and 5. Her learned weights are 0.6 for income, −0.9 for debts, and 0.1 for age, because she has come to trust income, distrust debts, and barely weigh age. She multiplies and adds: 8 × 0.6 gives 4.8, 2 × −0.9 gives −1.8, and 5 × 0.1 gives 0.5. The total is 3.5, a positive score, so she fires a yes-leaning signal to the next layer. Change a single weight and that verdict can flip. That is the entire computation inside a neuron, and a real network simply runs billions of these at once. Nothing harder than multiply-and-add is happening anywhere in it.
Layers turn simple steps into complex skill
Neurons are organised into layers. The first layer, the input layer, receives the raw data, say the pixels of an image or the tokens of a sentence. The last layer, the output layer, produces the answer, such as “this is a cat” or the next token in a reply. In between sit the hidden layers, where the real work happens. Each layer takes the previous layer’s outputs, applies its own weights, and hands its results forward. Information flows in one direction, from input through the hidden layers to output, which is why this basic design is called a feed-forward network.
The magic of depth is that each layer can build on the abstractions of the one before. In an image network, the earliest layer might only notice edges and patches of colour. The next combines edges into corners and curves. A later one assembles those into eyes, ears, and whiskers, and the final layers put it together as “cat.” Nobody told the network what an edge or an ear was. Those concepts emerged because the structure lets simple features in early layers compose into richer ones deeper in. “Deep learning,” the term from Part 3, simply means a network with many such layers stacked up.
Learning is just tuning the dials
Here is the question that matters: where do the weights come from? When a network is born, every weight is essentially random, so its first guesses are nonsense. Learning is the process of fixing that, and it works by trial and feedback rather than by anyone writing rules. You show the network an example for which you know the right answer. It runs the example through its layers and makes a prediction. You compare that prediction to the truth and measure how far off it was. Then comes the key step: the network adjusts its weights slightly, in the direction that would have made the answer a little less wrong.
One nudge on one example changes almost nothing. The power comes from repetition at scale. Show the network another example, measure the error, nudge again. Do this across millions of examples, many times over, and the weights drift away from randomness toward settings that get the answers right not just on the examples it saw but on new ones it did not. This is the entire loop behind every model in this series: guess, check, nudge, repeat. It is less like programming and more like tuning an instrument by ear, tightening each string a hair at a time until the whole thing rings true.
There is a classic mistake lurking here, and it is worth naming because the whole field guards against it. A network can tune its dials so tightly to the examples it studied that it essentially memorises them, acing the practice questions while failing anything new. This is called overfitting, and it is the machine-learning version of a student who learns the answer key instead of the subject. The defence is simple in principle: you hold back a slice of data the network never trains on, then test it only on that unseen slice. If it does well there, it has learned the general pattern rather than the specific examples. Getting a model to generalise, not memorise, is most of the real craft of training, and it is why “how much good, varied data do you have” turns out to matter even more than “how big is your network,” a theme we return to in a later part on data quality.
▾ Go Deeper (optional, for technical readers)
Two pieces of machinery make “measure the error” and “nudge the weights” precise. The first is the loss function. It is a formula that turns the gap between the prediction and the correct answer into a single number, the loss, where bigger means worse. Training has exactly one goal stated mathematically: make the average loss across the data as small as possible. Different tasks use different loss functions (cross-entropy for predicting a token, for instance), but they all serve the same role of scoring how wrong the network currently is.
The second is gradient descent, the method for choosing which way to nudge. Imagine the loss as a hilly landscape where your position is set by all the weights, and height is the error. You want the lowest valley. The gradient is simply the direction of steepest uphill at your current spot, so you step the opposite way, downhill, by a small amount controlled by the learning rate. Backpropagation is the efficient bookkeeping that computes, for every weight at once, how much it contributed to the error, so each can be adjusted in proportion. Repeat these small downhill steps over many batches of data and the weights descend toward low loss. That is the real engine under “guess, check, nudge,” and the part on training versus inference picks up the cost of running it at scale.
This is Part 6, the start of Phase 2, in a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. If “weights” or “parameters” felt fuzzy, revisit Part 4 on what a model really is, or start the whole series at Part 1.
The Bottom Line
A neural network sounds intimidating until you see its parts. It is a stack of layers, each made of simple neurons, with a weighted connection on every wire. A neuron just scales its inputs, adds them up, and passes a signal on. Depth lets early layers find simple features and later layers combine them into rich ones. And learning is nothing more exotic than guessing, measuring the error, and nudging the weights a little in the better direction, over and over, until the patterns stick. Every billion-parameter model is this same idea scaled up beyond intuition. Next we look at the very first thing that has to happen before a language network can learn at all: turning words into numbers, through tokens and embeddings. Now that you can picture the dials, what would you want a network to learn to predict?
References
- A neural network you can train in your browser (TensorFlow Playground)
- But what is a neural network? (visual series) (3Blue1Brown)
- What are neural networks? (IBM)
« Part 5: what GenAI can and cannot do | Generative AI Complete Guide | Next: Part 7, how words become numbers »








