Dr. Pranay Jha

VMware • Cloud • AI • Enterprise Architecture

FORMERLY
VMware Insight & Cloud Pathshala
What began over a decade ago as a passion for sharing knowledge has evolved into a unified platform for Enterprise AI, VMware, Cloud Architecture, Research, and Modern Infrastructure.

Why Data, Not Model Size, Usually Decides Quality (GenAI Series, Part 19)

A smaller model trained on more, cleaner data often beats a bigger one. Why parameter count is overrated, what the Chinchilla result showed, and how data curation decides quality.

9 minutes

Read Time

Generative AI Series · Part 19 of 30

TL;DR · Key Takeaways

  • Bigger is not automatically better. A smaller model trained on more and cleaner data often beats a larger model trained on less.
  • The 2022 Chinchilla result showed many famous models were badly undertrained: for their size, they had not been fed nearly enough data.
  • Data quality is decisive. Deduplication, filtering out junk, and removing contamination matter as much as raw volume.
  • Model size grabs the headlines because it is one easy number. Data is harder to talk about, which is exactly why it is underrated.

In 2022, researchers at DeepMind published a result that quietly embarrassed half the industry. They trained a model called Chinchilla with 70 billion parameters and showed it outperformed Gopher, a 280-billion-parameter model four times its size. The trick was not a cleverer architecture. Chinchilla was simply fed far more data for its size. The lesson landed hard: many of the famous giant models had been built top-heavy, lots of parameters starved of the data needed to use them. Size had been treated as the lever that mattered, and size was the wrong obsession.

Quality comes before quantity of parameters raw datamessy, huge clean & curatededup, filter tokenizetext → tokens TRAINthe model The model is only ever as good as what survives the first three boxes.
Training is the last step. Most of the quality is decided before a single parameter is touched.

Why size became the wrong obsession

Parameter count became the headline number for a simple reason: it is one clean figure that sounds impressive and is easy to compare. “175 billion parameters” makes a better headline than “trained on a carefully deduplicated 1.4-trillion-token corpus.” But a model’s parameters are only its capacity to learn. They are empty shelves. What fills the shelves is the data, and a vast model trained on too little or too poor data is like a huge library stocked with a few mediocre books: impressive frame, thin contents.

The Chinchilla work made this precise. For a given amount of computing budget, there is a balance between how big the model should be and how much data it should see, and the industry had been spending too much on size and too little on data. Correcting that balance, smaller models, far more tokens, produced better results for the same cost. This reframed the whole game. The frontier is not a contest to build the most enormous model; it is a contest to feed a well-sized model the best possible data. Once you see that, a lot of confusing claims about model quality start to make sense.

Garbage in, garbage out, at scale

If data is the lever, its quality is where the real work hides. The internet, the raw material for most models, is a mix of brilliance, boilerplate, spam, duplicated pages, and outright garbage. Train on it unfiltered and the model faithfully learns the garbage along with the gold, because it has no way to know which is which. So a huge share of the effort in building a good model goes into curation: throwing away the bad data before the model ever sees it. This is unglamorous, enormously consequential, and almost never makes the press release.

The curation pipeline does several jobs. It deduplicates, because the same text repeated thousands of times warps the model and wastes training on nothing new. It filters for quality, dropping spam, gibberish, and low-value pages while keeping clean, informative text. It removes toxic and harmful content so the model is less likely to reproduce it. And it screens for contamination, the accidental inclusion of test questions the model will later be evaluated on, which would make benchmark scores a lie. Each of these steps quietly raises the ceiling on how good the final model can be, and skimping on any of them shows up later as a model that is somehow worse than its size suggests.

The most striking demonstration of quality over quantity came from a line of small models built on deliberately excellent data. In a 2023 project pointedly titled “Textbooks Are All You Need,” researchers trained tiny models, on the order of a billion parameters, on a curated diet of textbook-quality and carefully generated examples rather than the raw firehose of the web. Those small models matched or beat much larger ones on reasoning and coding tasks. The headline was deliberately provocative, but the point is sober: a couple of billion parameters fed genuinely good material can outperform tens of billions fed average internet text. It is the cleanest available proof that the bottleneck was never just scale. Feed a modest model a curriculum instead of a junk drawer and it punches far above its weight, which is why “what did it train on” has quietly become the most important question about any model.

The data-quality funnel raw web data — vast and messy deduplicate filter for quality remove toxic / harmful screen out contamination clean data most of the web is discarded
What reaches the model is a small, hard-won fraction of what was collected.

Where tokenization quietly matters

There is one more data decision that punches above its weight: how the text is turned into tokens, the process from Part 7. The tokenizer is trained on data too, and the vocabulary it learns shapes how efficiently the model can represent everything afterward. A tokenizer that splits a language into too many tiny pieces makes every sentence in that language longer in tokens, which means the model needs more context and more compute to handle the same content, and effectively sees less of that language for the same budget. This is why models trained mostly on English are clumsier and pricier on other languages: not only did they see less of those languages, they encode them less efficiently too.

The broader point is that “data” is not just a pile of text you scrape and dump in. It is collected, deduplicated, filtered, balanced across domains and languages, and encoded through a tokenizer that was itself a design choice. Every one of those steps is a lever on final quality, and none of them is captured by the parameter count on the spec sheet. When two models of the same size behave very differently, the explanation almost always lives in this invisible data work rather than in the architecture.

Reality check: be skeptical of “our model is bigger so it is better.” Size is one input among several, and on its own it is a weak predictor of quality. I would ask instead how much data the model saw, how clean it was, and how it was curated, the questions vendors rarely lead with precisely because they are harder to reduce to a single shiny number.
▾  Go Deeper (optional, for technical readers)

The Chinchilla paper formalised this as scaling laws. For a fixed compute budget, model performance is a predictable function of two things you can trade against each other: the number of parameters and the number of training tokens. The headline finding was a rough rule that parameters and tokens should scale together in roughly equal proportion, implying that compute-optimal models are smaller and trained on far more data than the 2020-era giants. Those earlier models had poured compute into parameters while leaving them data-starved, which is why a right-sized model on more tokens could overtake them.

Two data hazards are worth naming precisely. Deduplication is not just tidiness: near-duplicate documents, common in web scrapes, cause the model to over-memorise repeated passages and can degrade generalisation, so good pipelines do fuzzy duplicate detection, not just exact matching. Contamination is the more insidious one: if test-set examples leak into the training data, the model can score brilliantly on a benchmark simply by having memorised the answers, making the benchmark meaningless. Detecting and removing this overlap is a constant, imperfect battle, and it is a major reason to distrust eye-popping benchmark numbers without knowing how contamination was controlled. A practical caveat to the scaling laws themselves: they assume a fixed compute budget and say nothing about inference cost, which is why, as Part 9 argued, teams deploying models often deliberately over-train smaller models past the compute-optimal point to get something cheaper to serve.

This is Part 19, the start of Phase 4, in a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It builds on tokenization (Part 7) and the cost framing in Part 9.

The Bottom Line

Parameters are capacity; data is what fills it, and an empty-headed giant loses to a well-read smaller model every time. The Chinchilla result made that concrete by showing how badly the industry had over-invested in size and under-invested in data, and the years since have been a scramble to feed models more and cleaner tokens rather than simply more parameters. Deduplication, filtering, contamination control, and even the tokenizer are the real, invisible levers on quality.

The takeaway I would hold onto: when you hear that a model is good, ask what it ate, not just how big it is. Data is the part nobody can reduce to a single number, which is exactly why it is both underrated and decisive. That said, size still has real costs at serving time, and the next part tackles a clever way to cut them, shrinking a model after training so it runs on hardware it has no business running on. Quantization is where we go next.

References

Generative AI Series · Part 19 of 30
« Part 18: evaluating GenAI output  |  Generative AI Complete Guide  |  Next: Part 20, quantization »

About The Author


Discover more from Dr. Pranay Jha

Subscribe to get the latest posts sent to your email.

Architect’s Toolkit

About the Author

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Discover more from Dr. Pranay Jha

Subscribe now to keep reading and get access to the full archive.

Continue reading