What Generative AI Can and Cannot Do (GenAI Series, Part 5)

An honest look at generative AI: what it is genuinely good at, where it quietly fails, and why hallucination comes from the same machinery that writes its best answers.

by

Dr. Pranay Jha

June 18, 2026

9 minutes

Read Time

Generative AI Series · Part 5 of 30

TL;DR · Key Takeaways

Generative AI is excellent at tasks of language and pattern: drafting, summarising, translating, rephrasing, brainstorming, and turning rough input into structured output.
It is unreliable at tasks that need guaranteed facts, precise math, or current events, because it predicts plausible text rather than retrieving truth.
Hallucination is not a rare bug. It is the same machinery that writes good answers, producing a wrong one with the same confident tone.
The honest rule of thumb: use it where a fluent draft a human will check is the goal, and avoid it where being wrong is expensive and nobody verifies.

Now that we know what a model is, a still function that guesses the next token, the obvious question is what that buys you in practice. This is where a lot of people get burned, in both directions. Some treat generative AI as an oracle and trust it with things it has no business deciding. Others write it off after one wrong answer and miss where it genuinely shines. The reality sits in between, and it follows directly from how these models work. This part is deliberately opinionated, because “it depends” is not useful advice. The goal is a clear sense of when to reach for GenAI and when to keep your hands off.

The split is not random. It tracks one line: language and pattern on the left, ground truth on the right.

Where it genuinely shines

Generative AI is at its best when the task is fundamentally about language and pattern, and when a knowledgeable human will look over the output. Drafting falls squarely here. Asking a model for a first version of an email, a job description, a function, or a blog outline turns a blank page into something to react to, and reacting is far easier than starting. It is similarly strong at transformation: take this long report and summarise it, take these bullet points and make them a paragraph, take this English and render it in Spanish, take this jargon and explain it to a beginner. None of these need the model to know a new fact. They ask it to reshape text it has already been given, which is exactly what next-token prediction does well.

It also excels at breadth on demand. A model has seen an enormous range of writing, so it is a tireless brainstorming partner that never runs short of angles, and it can extract structure from mess, pulling names, dates, and amounts out of a wall of unformatted text into a tidy table. The common thread in every one of these wins is the same: the cost of a mistake is low because a person is in the loop, and the value comes from speed and fluency rather than from guaranteed correctness. That is the sweet spot.

Where it quietly fails

The failures come from the same root. Because a model produces plausible text rather than retrieving verified records, it is shaky wherever the answer must be exactly right. Ask it for a specific statistic, a legal citation, or a quote, and it may hand you something that looks perfect and is entirely invented. It is weak at precise arithmetic for the same reason: it is predicting what a calculation tends to look like, not actually computing it, so long multiplications and careful counting can quietly go wrong. It has no knowledge of events after its training cut-off, so it cannot tell you today’s news or this morning’s stock price unless it is connected to a live tool. And it knows nothing about your private documents, your customers, or your internal systems unless you supply that context in the moment.

The dangerous part is not that these failures happen. It is that they arrive wearing the same confident voice as the correct answers. A model has no separate tone for “I am sure” versus “I am guessing,” because internally it is always doing the same thing. That is why the worst place to use raw GenAI is any high-stakes decision that nobody double-checks: medical, legal, or financial calls made on the model’s say-so, or automated actions with no human review. In those settings the occasional confident error is not a quirk, it is a liability.

Two questions, cost of error and presence of a checker, decide most cases.

The first honest look at hallucination

We met the word hallucination in Part 2; here is why it sits at the centre of what GenAI cannot promise. A hallucination is the model stating something false as if it were plainly true. It is tempting to imagine this as a glitch, a wire crossed somewhere that could be patched. It is not. The same process that produces a correct, useful sentence produces the hallucination. The model is always assembling the most plausible next token given everything so far, and most of the time plausible and true line up. When they do not, you get a fluent, well-formed, completely wrong answer, generated with the identical mechanism that gave you the right one a moment earlier.

This reframes the whole question of trust. You cannot eliminate hallucination by asking the model to be more careful, because it has no separate “careful mode” and no internal fact-checker to consult. What you can do is design around it: keep a human in the loop, give the model real source documents to ground its answers (the retrieval approach we cover later), and never let a single unverified output drive an important action. The models are improving and the rate is dropping, but the honest framing for now is that fluency is guaranteed and accuracy is not. Treat confidence as style, never as evidence.

Both lines leave the model the same way. The text gives you no signal about which one to trust.

Reality check: the right question is rarely “can GenAI do this task?” It usually can produce something. The right question is “what does a wrong answer cost here, and who catches it?” Answer those two and the decision to use it almost makes itself.

▾ Go Deeper (optional, for technical readers)

Whether models “reason” is genuinely contested, and the disagreement is partly about words. Skeptics argue that a next-token predictor is doing sophisticated pattern completion, reproducing the shape of reasoning seen in training rather than reasoning from principles, and they point to brittle failures on lightly altered puzzles as evidence. Others counter that if a system reliably solves novel multi-step problems, insisting it is “only” pattern matching is a distinction without a measurable difference. The newer “reasoning” models, which are trained to produce long intermediate working before answering, clearly help on math and logic, but they do not abolish the underlying behaviour.

This is also why benchmark claims deserve a careful eye. A headline like “scores 90% on exam X” can mislead for several reasons: the test questions, or close paraphrases, may have leaked into training data (contamination); a single percentage hides which kinds of problems it fails; and a model tuned to a benchmark’s format can post a high number without matching real-world usefulness. When you read a benchmark, ask what exactly was measured, whether the eval set could have been seen during training, and whether the metric reflects the task you actually care about. We devote a full later part to evaluating GenAI output properly, because “it scored well” and “it works for me” are not the same claim.

This closes Phase 1 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. If you have just arrived, the natural start is Part 1, what generative AI actually is, then Part 4 on what a model really is.

The Bottom Line

Generative AI is a remarkable tool with a precise shape. It is brilliant at language and pattern work, drafting, summarising, translating, reshaping, and brainstorming, especially when a person reviews what it produces. It is unreliable wherever the answer must be exactly true, freshly current, or drawn from data it never saw, and its hallucinations come dressed in the same confident voice as its best work. The whole of using it well reduces to one habit: match the task to the tool, weigh what a wrong answer costs, and make sure someone or something catches the errors. That clear-eyed view is the foundation for everything ahead. With Phase 1 done, we go under the hood next and start with the gentle question of how neural networks actually learn, no heavy math required. Which task on your own plate fits the “good fit” column, and which one have you been forcing?

References

How generative AI changes knowledge work (Nielsen Norman Group)
AI Index: capabilities and benchmarks (Stanford HAI)
Tokenizer: why models work in chunks, not facts (OpenAI)

Generative AI Series · Part 5 of 30
« Part 4: what a model really is | Generative AI Complete Guide | Next: Part 6, how neural networks learn »

About The Author

Dr. Pranay Jha

Dr. Pranay Jha is a Cloud and AI Consultant with 18+ years of experience in hybrid cloud, virtualization, and enterprise infrastructure transformation. He specializes in VMware technologies, multi-cloud strategy, and Generative AI solutions. He holds a PhD in Computer Applications with research focused on Cloud and AI, has published multiple research papers, and has been a VMware vExpert since 2016 and a VMUG Community Leader.

Dr. Pranay Jha