Foundations learning vertical

Track 01 · Foundations Beginner-friendly ~9 min

What is multimodal AI?

You already use more than one sense to understand the world — you read a sign, you hear a voice, you glance at a picture, and your mind ties them together. Multimodal AI is a single model that can do something similar: take in more than one kind of input — words, images, sound, video — and relate them to each other. This lesson explains what a "modality" is, why one model handling several is so useful, and, in plain terms, how it works — right here on the page.

Module progress

01What "modality" means — and why combine them

Start with one plain word. A modality is just a kind of input — a type of data the model receives. Text is one modality. An image is another. So is audio. So is video. A model that only reads text has a single modality; we call it unimodal. A model that can take in several kinds of input at once is multimodal. Why bother? Because the kinds of input often belong together. A photo and the question "what's in this picture?" make sense as a pair. A chart and "what's the trend here?" do too. When one model can hold several modalities at the same time, it can relate them — letting what it sees inform what it reads, and the other way around. Toggle the diagram to compare a single-modality model with a multimodal one.

WalkthroughOne modality vs. several

kind of input

Unimodal: the model takes in just one kind of input — here, text. It can be excellent at language, but it cannot see a picture or hear a sound.

A modality is a type of input — text, image, audio, or video are the common ones.
A unimodal model handles one kind; a multimodal model handles several at once.
The power of multimodal AI is that it can relate one modality to another — using a picture to answer a question, for example.

02How it works: one shared space for everything

Here is the key idea, kept simple. A computer can't compare a picture and a sentence directly — they're completely different kinds of data. So a multimodal model first turns each input into the same thing: a list of numbers called an embedding (think of it as a set of coordinates that captures meaning). Crucially, every modality is mapped into one shared space. In that space, items that mean similar things land close together — no matter which modality they came from. So a photo of a dog and the words "a dog" end up near each other, while a photo of a dog and the words "a bicycle" sit far apart. Once everything lives in the same space, the model can finally compare and relate inputs. Tap each part to see what it does.

ExploreTap a part

Inputs to a multimodal model

Many inputstext · image · audio

Encodersturn each into numbers

Shared spaceone place to compare

Relate & respondconnect, then answer

The starting point

Many inputs

A multimodal model can receive more than one kind of input — for example a sentence of text alongside a photo, or a spoken question with what a camera sees. Each kind is a different modality; the next steps turn them into something the model can compare.

Each input is turned into an embedding — a list of numbers that represents its meaning.
Every modality is mapped into one shared representation space, so different kinds of input can be compared.
In that space, related items land close together and unrelated ones land far apart — regardless of modality.

03Aligning images and text: the shared-space explorer

This is the heart of multimodal AI, and you can poke at it directly. CLIP-style models learn to align images and text: during training, a matching image and caption are pulled close together in the shared space, while mismatched pairs are pushed apart. After enough examples, the model can relate pictures and words it has never seen — without anyone hand-labeling every object. Below, an image input and several text inputs all live in one shared space. Pick any input to see what it lands closest to. Notice how the picture of a cat sits near the words "a cat," and far from "a bicycle." The closeness numbers are illustrative — chosen to show the idea, not produced by a trained model.

InteractivePick an input — or step through them

Line thickness and the percentages show closeness in the shared space (a higher number means more related). These values are illustrative, chosen to show the idea — a trained model learns its own. Use Tab + Enter, or the buttons, to step through inputs.

Alignment means matching pairs (an image and its caption) are placed close together in the shared space.
Closeness is how the model relates a picture to words — the closest text is the best-matching description.
This is learned from large collections of image–text pairs, not from hand-labeling every object.

04Vision-language & native-multimodal models

Aligning images and text is the foundation; modern systems build on it in different ways. Switch between the kinds you'll meet most:

ExploreSwitch type

Image–text alignment — the foundation

A CLIP-style model learns one shared space for images and text by pulling matching pairs together and pushing mismatches apart. On its own this is powerful: it can match a photo to the best caption, or find images from a text search — relating the two modalities without generating new text.

good for: image search, matching photos to captions, zero-shot labelling

idea: put images and words in one space, measure closeness

Vision-language models (VLMs)

A vision-language model takes both images and text as input and can reason about them together — describing a photo, or answering a question about a chart. It pairs a way of "seeing" the image with a language model that can talk about what it sees.

good for: image captioning, visual question answering, reading diagrams

idea: see the image, then answer in words

Native-multimodal models

A native-multimodal model is built from the start to handle several input types — and sometimes several output types — within a single system, rather than separate models stitched together. One model might accept text, images, and audio, and respond in text or speech.

good for: voice + vision assistants, mixed-input tools

idea: one model, many kinds of input and output

Why multimodal matters in practice

Because so much of the real world is mixed. A receipt is an image and text. A help request might be a screenshot plus a sentence. A voice assistant that can also see your camera can answer "what is this?" out loud. Handling modalities together lets one model meet people where they already are, instead of forcing everything into plain text first.

strengths: mixed inputs · richer context · natural interaction

result: assistants that read, see, and listen

05Real uses — and real limits

Multimodal AI already does useful, everyday things: describing what's in a photo for someone who can't see it, answering a question about a chart or a document, and powering voice-and-vision assistants that respond to what they hear and see. But the same systems have real limits worth keeping in mind. They can hallucinate across modalities — confidently describing something that isn't actually in the image. They can carry bias from the data they learned on. And handling rich inputs like images and video can be costly to run. Knowing both sides is part of using these tools well.

WalkthroughStep through uses and limits

Describe an image

Captioning. Given a photo, the model writes a sentence describing what's in it — useful for accessibility and for organizing large image collections.

Answer about a chart

Visual question answering. Show the model a chart or document and ask a question; it reads the visual and replies in words — "the value peaks in the third column."

Voice + vision

Assistants that see and hear. Combine spoken input with what a camera sees, so you can ask out loud "what is this?" and get a spoken answer.

Watch for hallucination

A real limit. The model can hallucinate across modalities — describing details that aren't actually in the image. Treat confident descriptions as claims to verify, not facts.

Bias & cost

Two more limits. Outputs can reflect bias in the training data, and running models on large images or video can be costly. Good use means weighing these trade-offs.

Real uses: describing images, answering questions about charts and documents, and voice-and-vision assistants.
Hallucination across modalities means the model can describe things that aren't really in the input — verify before trusting.
Bias and cost are practical limits to weigh whenever you deploy these models.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Multimodal AI in 5 minutes" — one-page summary

The whole module distilled to a printable cheat-sheet.

▸ Coming next — deeper progression (specced & grounded)

Coming soon

Inside a vision-language model

How a VLM connects an image encoder to a language model so it can answer questions about what it sees.

In the pipeline

Coming soon

Multimodal limits, in depth

A closer look at cross-modal hallucination, bias, and cost — and practical ways to evaluate and reduce them.

In the pipeline

→Continue learning

Sources & review

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; figures shown in the interactives are illustrative and labelled as such.

Vision-language models — Transformers documentation — Hugging Face
A Survey on Multimodal Large Language Models — Yin et al. (2023)
Learning Transferable Visual Models From Natural Language Supervision (CLIP) — Radford et al. (2021)
Flamingo: a Visual Language Model for Few-Shot Learning — Alayrac et al. (2022)

What is multimodal AI — in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

What a modality is

A modality is a kind of input — text, image, audio, or video. A model that takes one kind is unimodal; a model that takes several at once is multimodal. The value of multimodal AI is that it can relate one modality to another.

One shared space

Each input is turned into an embedding — a list of numbers capturing its meaning. Every modality is mapped into one shared space, where related items land close together regardless of which modality they came from. That is what lets a model compare a picture and a sentence.

Image–text alignment (CLIP)

CLIP-style training aligns images and text: matching image–caption pairs are pulled close, mismatches pushed apart. The closest text to an image is its best-matching description — learned from paired data, not hand-labeled objects.

VLMs & native multimodal

A vision-language model reasons about images and text together (captioning, visual question answering). A native-multimodal model is built from the start to handle several input and output types in one system — e.g. voice + vision assistants.

Real uses & limits

Uses: describing images, answering questions about charts, voice-and-vision assistants. Limits: cross-modal hallucination (describing things not in the input), bias from training data, and cost on large inputs. Verify confident descriptions.

Gallery

Contacts