TL;DR · Key Takeaways
- Multimodal models handle more than text: images, audio, and video, often several at once, in a single model.
- The trick is unification. Every input is converted into the same kind of thing, vectors in a shared space, so one transformer can process them all.
- An image becomes a grid of patches, each turned into an embedding and fed in alongside word tokens, so “a picture” and “a sentence” look similar to the model.
- This unlocks reading documents, answering questions about photos, describing charts, and transcribing speech, all with the same core machinery you already understand.
Point your phone at the inside of your fridge and ask what you can cook. Photograph a parking sign in a foreign city and ask whether you are allowed to park. Paste a screenshot of a confusing chart and ask what it means. A few years ago each of those was a separate, specialised research problem. Today one model handles all of them, in the same conversation, because the newest models do not only read and write, they see and hear. The surprising part is how little new machinery this required. Multimodal AI is mostly the ideas you already have from this series, pointed at pixels and sound.
What “multimodal” means
A “modality” is just a type of data: text is one, images another, audio another, video another still. A model is multimodal when it can take in or produce more than one of these. The most common kind today is the vision-language model, which accepts both text and images and replies in text, the thing that lets you upload a photo and ask about it. Others handle speech in and out, and some can generate images or audio as output. The combinations vary, but the family resemblance is that a single model crosses the boundary between senses that used to require separate systems bolted together.
That matters because the old way was clumsy. To answer a question about an image, you once needed one system to caption the picture, another to read the text, and a third to reason over the results, each a separate model with its own failure points. A single multimodal model collapses that pipeline. It can look at a diagram and the question about it at the same time, letting the image and the words inform each other directly, the way you do when you glance at a chart while reading its caption. The integration is the whole advantage.
How one model spans different senses
The unlock is an idea you already met in Part 7: turn everything into vectors in a shared meaning-space. A language model works because words become embeddings, points in a space where similar meanings sit close. Multimodal models extend the same move to other senses. An image is converted into embeddings too, and crucially into the same space as the words, so a picture of a dog and the word “dog” land near each other. Once both live as vectors in one space, the transformer from Part 8 does not much care whether a given vector came from a sentence or a photograph. It just attends over them all together.
For images specifically, the model chops the picture into a grid of small patches, like tiles, and turns each patch into an embedding, much as text is chopped into tokens. Those patch-embeddings are then fed into the model right alongside the word tokens, as if the image were a little paragraph written in visual vocabulary. Audio gets a similar treatment: the sound is sliced into short segments and encoded into vectors. The deep lesson is that the transformer is not really a “language” machine at all. It is a sequence-of-vectors machine, and language was simply the first sequence we taught it. Feed it vectors that happen to come from pixels and it works on those too.
What it unlocks in practice
The practical range is wide and growing. Vision-language models read documents and screenshots, turning a photographed form, a receipt, or a slide into structured data, which quietly automates a huge amount of dull office work. They answer questions about images, from “what is wrong with this rash” to “which wire connects where,” and they describe visuals for accessibility, giving blind users a sentence about what is on screen. They read charts and tables that would baffle a text-only model. On the audio side, speech recognition has become near-effortless, and models that take voice in and give voice out are making spoken conversation with AI feel natural.
It is worth being clear-eyed, though, because the seams still show. Multimodal models are noticeably less reliable on images than on text: they miscount objects, misread cluttered scenes, fumble fine detail in dense documents, and can hallucinate about a picture as confidently as they hallucinate about facts. The capability is real and genuinely useful, but it is younger and rougher than the text abilities we have spent this series on. Treat a model’s reading of an image the way you would treat its reading of a fact: impressive, helpful, and in need of a check when it matters.
▾ Go Deeper (optional, for technical readers)
How does an image actually enter the same space as text? The dominant approach uses a Vision Transformer (ViT). The image is split into fixed-size patches, say 16×16 pixels, and each patch is flattened and linearly projected into a vector, exactly analogous to how a token becomes an embedding. Position information is added so the model knows where each patch sat in the grid, and that sequence of patch-vectors is processed by transformer layers just like a sequence of word-vectors. The result is that “an image” arrives at the language model as a short sequence of embeddings it can attend to.
The bridge between vision and language is built by training. Models in the CLIP lineage are trained on huge sets of image-and-caption pairs with a contrastive objective: push the embedding of an image and the embedding of its matching caption together, and everything mismatched apart. After enough of this, the picture of a beach and the words “a sandy beach” land near each other in the shared space, which is what lets a downstream model relate the two. Full vision-language models then connect such a visual encoder to a language model through a small projection layer and fine-tune the whole thing on visual question-answering data. The elegance is that almost none of this is new: it is the embedding idea, the transformer, and contrastive training, recombined so that pixels and words speak the same vector language.
This is Part 17 of a 30-part walk from zero to the infrastructure behind production AI. The full map is on the Generative AI Complete Guide. It reuses the embeddings idea from Part 7 and the transformer from Part 8.
The Bottom Line
Multimodal AI sounds like a leap into new technology, but it is really the same foundations pointed at new kinds of data. Turn an image into patches, turn patches into vectors in the same space as words, and the transformer you already understand can reason over pictures and sound as comfortably as over sentences. That unification is why one model can read your documents, answer questions about a photo, and transcribe your voice without three separate systems.
The honest caveat is that seeing still lags reading: the vision and audio skills are real and useful but rougher, so the same verify-what-matters discipline applies. With multimodality covered, Phase 3 has shown how to use these models well. The series now turns to a question that decides whether any of it is trustworthy in practice: how do you actually measure whether a model’s output is good, beyond it merely looking good? That is where we go next.
References
- An Image is Worth 16×16 Words: Vision Transformers (Dosovitskiy et al., 2020)
- Learning Transferable Visual Models (CLIP) (Radford et al., 2021)
- What is multimodal AI? (IBM)
« Part 16: AI agents | Generative AI Complete Guide | Next: Part 18, evaluating GenAI output »








