TL;DR · Key Takeaways
- Almost every dollar in generative AI traces back to one thing: GPU time. Tokens are just the meter on the GPU.
- Your bill is driven by model size, the number of tokens in and out, and the context length, with output tokens usually the most expensive.
- The big decision is build vs buy: pay per token to a provider, or run your own GPUs. One is cheap to start, the other is cheaper at scale.
- The hidden lever is utilization. Through batching, a busy GPU serves many requests at once, which is what makes the per-token price low.
The demo cost about forty cents to run. The first real month in production cost eleven thousand dollars, and nobody on the team could quite explain the jump. This is one of the most common shocks in applied AI, and it happens because generative AI bills are counted in a unit, the token, that feels abstract right up until it is multiplied by real traffic. To budget for this technology you have to follow the money down to where it actually goes, and the trail ends, every time, at a graphics card running flat out. Part 9 told you that inference, not training, is the recurring cost. This part opens up that recurring cost and asks what you are really paying for.
It all comes back to GPU time
Whatever the invoice says, the underlying cost of running a model is the time it spends on expensive accelerators. When a provider charges you per token, that price is a convenient wrapper around GPU-seconds: generating a token takes a certain amount of computation, that computation occupies a slice of a costly chip, and you are billed for the slice. The token is the meter; the GPU is the engine. This is why everything in Phase 4, smaller models, quantization, better data, was really about cost: each one reduces how much GPU time a given quality of answer requires.
Three things move that GPU time, and therefore your bill. The first is model size: a bigger model does more computation per token, so it costs more for every word it reads or writes. The second is the number of tokens, both the ones you send in and the ones it generates out. The third is context length, which, thanks to the quadratic attention cost from Part 10, makes long prompts disproportionately expensive. Get a feel for these three dials and you can predict a cost before you ever see an invoice, which is most of what good AI budgeting is.
Why output tokens cost more than input
A detail that catches people out: providers usually charge more for output tokens than input tokens, often several times more. The reason is mechanical. Input tokens are processed all at once, in a single parallel pass over your prompt, which GPUs do efficiently. Output tokens are generated one at a time, each requiring its own pass through the entire model before the next can begin. That sequential generation is the slow, expensive part of inference, so the tokens the model writes cost more than the tokens it reads. The practical upshot is that a chatty model that pads every answer with preamble is not just annoying, it is literally more expensive, and asking for concise output is a real cost control.
Context length compounds this. Every token the model generates has to attend back over the entire context, so a long prompt makes each output token more expensive, not just the input more expensive. Stuff a huge document into the context on every call and you pay for it twice: once to process the input, and again on every output token that has to look back across all of it. This is the cost shadow of the “bigger window” temptation from Part 10, and it is why disciplined retrieval, sending only the relevant passages, is a budget decision as much as a quality one.
Put rough numbers on the opening shock and it stops being mysterious. Say each support conversation sends a 2,000-token prompt (system rules, retrieved context, history) and gets back a 500-token answer. The demo, a handful of test chats, touches a few thousand tokens total: pennies. Now serve 50,000 conversations a day. That is 100 million input tokens and 25 million output tokens daily, and because output is priced several times higher, the answers can cost as much as the far larger pile of input. Multiply by thirty days and a “cheap” per-token rate has quietly become a five-figure monthly line item. Nothing went wrong; the meter simply met real traffic. The fix is not panic but design: trim the prompt, cap the output, retrieve less, and the same product runs at a fraction of the bill.
Build vs buy: the decision underneath the bill
Once you know the cost is GPU time, the big strategic question comes into focus: do you rent that GPU time from a provider by the token, or own the GPUs and run models yourself? Buying through an API is wonderfully cheap to start. There is no hardware, no setup, and you pay only for what you use, which is exactly right for early products, spiky traffic, and experiments. Its weakness is that the per-token price includes the provider’s margin and never goes away; at high, steady volume you are renting forever.
Running your own GPUs flips the shape of the cost. You take on a large fixed expense, the hardware and the people to operate it, in exchange for a much lower marginal cost per token once it is running. Below some level of usage that fixed cost is dead weight and the API wins easily. Above it, your owned hardware is cheaper per token and the gap widens with every request, so heavy, predictable workloads eventually favour owning. There is a crossover point, and finding it honestly, including the cost of the team and the risk, is the heart of the build-versus-buy call. This is the teaser; Part 27 settles the on-prem versus cloud question in full.
▾ Go Deeper (optional, for technical readers)
The single biggest reason API per-token prices can be so low is batching, and it turns on a fact we will dwell on next phase: generating a token is usually memory-bandwidth bound, not compute bound. To produce one token, the GPU must read the model’s entire set of weights from memory. That read is the bottleneck, and crucially, if many requests are processed together in a batch, that same single read of the weights serves all of them at once. So serving 1 request and serving 50 requests can cost almost the same in the expensive resource, which means the cost per request falls dramatically as the batch fills up.
This is why utilization is the hidden economic variable. A provider serving millions of requests keeps its GPUs packed into large batches, driving the effective cost per token far below what you could achieve running one model for one application on your own underused card. It is also why a self-hosted GPU sitting at 10% utilization is a terrible deal: you are paying for the whole chip and the whole weight-read while serving almost no one. The lesson cuts both ways. If you buy, you benefit from someone else’s scale and batching. If you build, your entire cost case depends on keeping those GPUs busy, because an idle accelerator bills you exactly the same as a saturated one. Continuous batching and the serving techniques of the next phase exist precisely to push utilization, and therefore cost-per-token, in the right direction.
This is Part 22, the close of Phase 4, in a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It builds directly on inference cost (Part 9) and the context window (Part 10).
The Bottom Line
The economics of generative AI are less mysterious than the invoice makes them look. Strip away the token-counting and you are paying for GPU time, driven by how big your model is, how many tokens flow in and out, and how long your context is, with the model’s own output the priciest part. The strategic fork is build versus buy, renting cheap-to-start API tokens or owning GPUs that pay off only at steady scale, and the hinge underneath both is utilization, because a busy GPU shared across a full batch is what makes any token cheap.
My blunt advice: budget on realistic traffic, not a headline price, keep prompts and answers lean as a habit, and do not own hardware you cannot keep busy. Understand these levers and AI cost stops being a scary surprise and becomes something you can design around. We have now seen that GPU time is the whole game. Phase 5 goes inside the GPU to explain why, starting with the reason these models run on graphics chips at all, and the memory wall that limits everything they do.
References
- Token pricing, input vs output (OpenAI)
- LLM inference performance and cost (Databricks)
- How continuous batching lowers cost per token (vLLM)
« Part 21: guardrails and responsible AI | Generative AI Complete Guide | Next: Part 23, GPUs and the memory wall »








