Language & Generation · learning vertical

Track 01 · Language & Generation Novice · start here ~8 min

How do machines hear and speak?

When you talk to a voice assistant, two quiet conversions happen. First it has to hear you — turn the sound of your voice into text. Then it has to speak — turn text back into a voice. Those two moves, plus a thinking step in the middle, are all of speech AI. Here's the whole idea, on one page.

Module progress

01Two directions: machines that hear and machines that speak

Think about a phone call with a friend. You listen to what they say, and you speak back. A voice assistant has to do both of those jobs — and each one is its own kind of problem. Turning your spoken voice into written words is called automatic speech recognition, or ASR: sound goes in, text comes out. Turning written words back into a spoken voice is called text-to-speech, or TTS: text goes in, sound comes out. They are mirror images of each other — one listens, one talks — and together they form the whole speech pipeline.

ASR (automatic speech recognition) is the hearing direction: sound → text.
TTS (text-to-speech) is the speaking direction: text → sound.
A talking assistant is really these two systems, with a thinking step wired in between them.

Signature interactiveStep through the pipeline

Illustrative sample — not a real recording or model output

02ASR: how a machine turns sound into text

Your voice reaches a microphone as a sound wave — a wiggling line of air pressure over time. A computer can't read a wiggle directly, so the first move is to redraw it as a spectrogram: a picture that shows which pitches are present and how they change moment to moment. From that picture, an acoustic model works out which speech sounds are being made, and those sounds are assembled into text. Older systems did this in many hand-built stages; modern end-to-end models (the Whisper-class systems) learn to go more directly from audio to text, trained on huge amounts of recorded speech paired with its transcript. Use the Hear it (ASR) path in the stepper above to watch wave → spectrogram → text.

A spectrogram is just a picture of the sound: which pitches appear, and when.
The acoustic model maps those sound patterns toward the units of language that become words.
End-to-end models (Whisper-class) fold the steps into one system learned from data, not hand-written rules.

Why it can mishear: a spectrogram of a noisy room has the speech smeared together with everything else, and a model recognizes best the kinds of voices it heard most during training. That's why background noise and unfamiliar accents are the two classic reasons a transcript comes out wrong.

Quick breather So far: one direction. A microphone hears a sound wave, the wave is redrawn as a spectrogram, and a model reads that picture to produce text. That's the whole hearing side. Next we run the pipeline the other way — and turn text back into a voice.

03TTS: how a machine turns text into a voice

Now run the pipeline backwards. You hand the system some text, and it has to produce audio that sounds like a person reading it aloud. First it figures out how the words should sound — it breaks the text into phonemes, the basic building-block sounds of speech, along with other features like rhythm and emphasis. Then a part called a neural vocoder takes those sound features and generates the actual audio waveform you hear — building the sound itself rather than gluing together pre-recorded clips. This text → phonemes/features → vocoder → audio flow is the shape behind modern, natural-sounding voices. Try the Speak it (TTS) path in the stepper.

A phoneme is a basic unit of speech sound — the building block of how a word is pronounced.
The neural vocoder generates the final waveform you hear, rather than replaying stored recordings.
The path is text → phonemes/features → neural vocoder → natural audio — the exact mirror of ASR.

04The voice agent: hear, think, speak — fast enough to feel live

Put both directions together and add a brain in the middle, and you get a voice agent — the thing that holds a spoken conversation. The loop is: your speech goes in, ASR turns it into text, a language model (LLM) reads that text and decides what to say, TTS turns the reply back into a voice, and the speech comes out. The hard part isn't any one step — it's doing all of them fast enough that the back-and-forth feels natural. Every stage adds a little delay, and too much total delay (latency) makes the conversation feel laggy. One common trick is streaming: starting to process your words before you've even finished the sentence, so the reply can begin sooner. Switch the stepper to Round trip to walk the full loop.

The loop is speech in → ASR → LLM → TTS → speech out — hear, think, speak.
The LLM is the thinking step: it reads the transcript and writes the reply text.
Latency is the central challenge; streaming overlaps the stages to cut the perceived delay.

Worth knowing: the language model in the middle is the same kind of next-token-predicting system used in chatbots — so a voice agent inherits the same caution. It can sound confident and still be wrong. Treat what it says as a helpful draft, and verify anything important.

05What it's good for, and what to watch out for

Speech AI quietly powers a lot already: live captions, dictation, voice assistants, audiobooks, call-center help, and accessibility tools for people who can't easily type or read. But the same systems carry real limits. Accuracy drops with background noise, and because models learn best from the speech they were trained on, they can serve some accents and dialects worse than others — a fairness problem, not a quirk. And the speaking side has a sharper edge: voice cloning can copy how a specific person sounds, which can be misused to impersonate someone without their consent — a "deepfake voice." That's why consent, disclosure, and verifying who you're really talking to matter more as these voices get more convincing.

Great for: captions, dictation, assistants, audiobooks, and accessibility.
Watch out for: noise and accent bias on the hearing side — performance often tracks the training data.
Handle with care: synthetic voices can impersonate people, so consent and disclosure are essential.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"Speech & Voice AI in 5 minutes" — one-page summary

The whole module distilled to a printable cheat-sheet.

▸ Look up a term — AI glossary

Glossary

Automatic speech recognition (ASR)

The one-line definition of the hearing direction — sound to text — plus the terms around it.

Look up →

Glossary

Text-to-speech (TTS)

What the speaking direction is, and how phonemes and a vocoder turn text into a voice.

Look up →

▸ Coming next — deeper progression

Coming soon

Spectrograms & audio features

How a sound wave becomes the time-frequency picture a recognition model actually reads.

In the pipeline

Coming soon

Building a voice agent

Wiring ASR, an LLM, and TTS into a low-latency loop that feels like a live conversation.

In the pipeline

→Continue learning

Sources & review

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; the stepper sample is illustrative and labelled as such.

Robust Speech Recognition via Large-Scale Weak Supervision (Whisper) — Radford et al. (2022)
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (Tacotron 2) — Shen et al. (2017)
WaveNet: A Generative Model for Raw Audio — van den Oord et al. (2016)
Hugging Face Audio Course — Hugging Face

Speech & Voice AI — in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

Two directions

Speech AI has a hearing side and a speaking side. ASR (automatic speech recognition) turns sound into text. TTS (text-to-speech) turns text into sound. They are mirror images.

ASR — sound to text

A microphone captures a sound wave. It is redrawn as a spectrogram — a picture of which pitches appear over time. An acoustic model maps that toward language units, which become text. Modern end-to-end models (Whisper-class) learn this mapping directly from data.

TTS — text to sound

Text is turned into phonemes (basic speech sounds) and features, then a neural vocoder generates the actual audio waveform — building the sound rather than replaying clips. Flow: text → phonemes/features → vocoder → audio.

The voice agent loop

A talking assistant chains both directions with a brain between them: speech in → ASR → LLM → TTS → speech out. The LLM decides what to say. The hard part is latency — doing it fast enough to feel live; streaming overlaps the stages to help.

Limits & responsible use

Accuracy drops with noise and can be biased toward accents seen most in training. Synthetic voices can be misused to impersonate people (deepfake voice), so consent and disclosure matter. As with any LLM, verify anything important.

Gallery

Contacts