Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

Training vs Inference: Why Using AI Is the Real Cost (GenAI Series, Part 9)

Training builds a model once in three stages; inference runs it on every request, forever. Why the recurring inference bill, not the headline training cost, decides AI economics.

9 minutes

Read Time

Generative AI Series · Part 9 of 30

TL;DR · Key Takeaways

  • Training is building the model. It happens in stages, pretraining, then fine-tuning, then alignment, and it is a huge one-time cost.
  • Inference is using the finished model to answer a single request. It is cheap per go, but it happens on every query, forever.
  • The headline training figure gets the attention, but for almost anyone deploying AI, the recurring inference bill is the number that decides the economics.
  • This split is why the whole back half of this series is about serving models efficiently, not training them.

One number gets all the headlines: the cost of training a frontier model, reported anywhere from tens of millions to well over a hundred million dollars. It is a staggering figure, and it makes for a great headline. It is also, for almost everyone actually building with AI, the wrong number to obsess over.

Here is the thing that surprised me when I first internalised it. That gigantic training cost is paid once. The cost that never stops, the one that quietly determines whether your AI product makes money or loses it, is paid every single time someone presses enter. To see why, you have to separate the two completely different activities that the word “AI cost” smears together: building the model, and using it.

Building a model: three stages 1. PRETRAININGread the internet,learn language itself 2. FINE-TUNINGteach it to followinstructions 3. ALIGNMENThuman feedback forhelpful, safe replies Costs shrink at each stage: pretraining is by far the heaviest lift.
A usable chat model is built in three passes, not one, and most of the bill is the first.

Building the model, in three passes

Training is not one event. It is three, and they do very different jobs. The first and largest is pretraining. The model is shown a colossal slice of text, much of the public internet plus books and code, and made to play the next-token game from Part 4 billions upon billions of times. This is where it learns grammar, facts, reasoning patterns, and the general shape of language. It is also where nearly all the money and time go: thousands of GPUs running for weeks or months. When you read about a model costing a fortune to train, this stage is almost all of it.

A freshly pretrained model is oddly hard to use. It is a brilliant autocomplete that will happily continue your question with ten more questions, because all it has learned to do is extend text plausibly. So comes fine-tuning: a much shorter round of training on curated examples of the behaviour you want, typically instructions paired with good responses. This is what teaches the model to treat your input as a request to answer rather than a passage to continue. It is far cheaper than pretraining because it is a light touch on an already-capable model, not a rebuild.

The third pass is alignment, and it is the one that turned capable models into ones people actually like talking to. Here humans weigh in on which of the model’s answers are more helpful, honest, and harmless, and that judgment is used to nudge the model toward the preferred style and away from the rest. This is the stage that makes a model decline a harmful request or admit uncertainty instead of bluffing. The technical name you will hear is RLHF, and it gets the Go Deeper treatment below.

Build once, then serve forever TRAININGone-time, intense,then it is done INFERENCE (SERVING) request request request request The right-hand box never closes for as long as the product is live.
Training is a project with an end date. Inference is an open tab.

The asymmetry that runs the economics

Now put the two side by side. Training is enormous but finite. You spend the money, you get a model, the meter stops. Inference is tiny per request but unbounded: every question a user asks runs the model again, consumes GPU time again, and costs a fraction of a cent to a few cents again. One conversation is trivial. The problem is that successful products do not have one conversation. They have millions a day.

Multiply a tiny number by a vast one and you get a big number. A model used at scale will, over its serving life, rack up inference costs that dwarf what it took to train. This is the asymmetry that catches teams off guard. They budget for the dramatic, one-time training spend, or they pay a vendor by the token assuming it will stay cheap, and then usage grows and the recurring bill becomes the entire business problem. It is also why a smaller, cheaper-to-run model that is “good enough” often beats a larger, smarter one in production: the size difference you pay for once at training, but you pay for it on every request thereafter.

Put rough numbers on it and the point stops being abstract. Imagine a model where a single answer costs you half a cent to generate. A weekend side project handling a thousand requests a day spends five dollars a day on inference, which is a rounding error you will never notice. Drop that exact same model into a product doing five million requests a day, the kind of traffic a modestly popular app reaches, and you are now spending twenty-five thousand dollars a day. That is roughly nine million dollars a year, every year, just to keep answering. The model never got more expensive to run. You simply ran it more, and at that volume the recurring bill closes in on a frontier training budget within months. The numbers are illustrative, but the shape of them is exactly what teams discover the hard way.

Why inference wins the cost race time → trainingonce each request is small … … but they never stop, and they add up
The tall black bar is what makes headlines. The red bars are what empties the bank account.

Why this shapes everything that follows

If you only train models, you care about pretraining. Almost nobody only trains models. The vast majority of people working with AI take a model someone else built and run it, which means their entire cost and performance story is an inference story. How fast does a reply come back? How many requests can one GPU serve at once? How much memory does each conversation eat? Those questions, not training, are where real budgets are won and lost.

This is the honest reason the back half of this series leans the way it does. Quantization, batching, the choice of inference engine, the memory wall, on-prem versus cloud: every one of those topics exists to make inference cheaper or faster, because inference is the part you pay for again and again. Training is the headline. Inference is the business.

Reality check: when a vendor quotes you a low price per thousand tokens, do the multiplication before you celebrate. A few cents per call sounds free until you model real traffic, long prompts, and a chatty product. I have seen far more AI projects undone by inference costs creeping up under growth than by anyone struggling to afford training.
▾  Go Deeper (optional, for technical readers)

The alignment stage usually runs as RLHF, reinforcement learning from human feedback, and it optimizes something subtler than “predict the next token.” It works in two moves. First, you collect human preferences: people are shown two model answers to the same prompt and pick the better one, thousands of times. Those comparisons train a separate reward model whose only job is to look at a response and output a score estimating how much a human would like it. The reward model is, in effect, a learned stand-in for human taste.

Second, the language model is tuned to produce answers that the reward model scores highly, typically with a reinforcement-learning algorithm such as PPO, while a penalty keeps it from drifting too far from its original behaviour and gaming the score with gibberish. So RLHF does not teach the model new facts; it shifts which of the things it could already say it chooses to say, toward what people preferred. A lighter-weight alternative called DPO (direct preference optimization) skips the separate reward model and tunes directly on the preference pairs, which is cheaper and increasingly popular. Both share the same goal: turn raw next-token competence into a helpful, well-mannered assistant. And both are far cheaper than pretraining, which is why labs can refresh a model’s behaviour without rebuilding it from scratch.

This is Part 9 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. Newer here? Part 4 explains what a model really is, and Part 1 starts the whole series.

Training vs inference at a glance

AspectTrainingInference
What it isBuilding the modelUsing the finished model
WhenOnce, then doneEvery request, forever
Cost patternHuge, one-timeSmall per call, unbounded total
Who mainly paysFoundation labsEveryone who deploys
What to optimiseData and computeThroughput and utilization
Training is a project with an end date; inference is an open tab.

The Bottom Line

Training and inference are two different economies wearing the same word. Training builds the model in three descending stages, pretraining for raw ability, fine-tuning for instruction-following, alignment for good manners, and it is a heavy cost you pay once. Inference is what happens every time the finished model answers, cheap on its own and relentless at scale.

My own rule of thumb is blunt: if you are choosing or designing an AI system, decide it on the inference numbers, because that is the bill that arrives every month. The eye-catching training figure belongs to the handful of labs building foundation models. For everyone downstream, the cost that matters is the one that never stops. That is exactly why the next few parts dig into the things that drive it, starting with the context window and why longer prompts cost more than they look.

References

Generative AI Series · Part 9 of 30
« Part 8: the attention idea  |  Generative AI Complete Guide  |  Next: Part 10, the context window »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading