Voice AI just got a product taxonomy.
OpenAI didn’t ship a single “realtime” upgrade. It shipped three distinct models, each optimized for a different job: live reasoning, live translation, and live transcription. For developers who’ve been treating the Realtime API as one thing, that distinction matters immediately for architecture decisions.
OpenAI’s announcement confirms all three models are available in the API as of May 7, 2026. Here’s what each one actually does.
GPT-Realtime-2
brings GPT-5-class reasoning to live voice interactions. That’s OpenAI’s own characterization, no independent benchmark exists yet for its voice-specific performance. What it means practically: if you’re building a voice agent that needs to handle multi-step reasoning mid-conversation, this is the model tier to evaluate. It’s not a transcription layer. It’s the reasoning engine.
Unanswered Questions
- Does the 13-output-language ceiling expand, and on what timeline?
- What is per-model API pricing at production call volume?
- How does GPT-Realtime-2 latency hold under concurrent session load, not just single-session conditions?
GPT-Realtime-Translate
handles live speech translation from 70+ input languages into 13 output languages. The asymmetry here matters. Seventy-plus languages in, thirteen out. Enterprise teams building multilingual voice tools need to check their target output languages against that list before scoping any project. OpenAI says it “keeps pace” with live speech, OpenAI states the models support sub-100ms latency for voice-to-voice interactions, though that figure reflects the company’s own specifications and hasn’t been independently benchmarked.
GPT-Realtime-Whisper
is a streaming transcription model. It transcribes in real time rather than processing audio in chunks. For call center analytics, accessibility tooling, or any workflow that needs a live text stream from audio, this is the dedicated path.
The part nobody mentions: until now, developers evaluating the Realtime API were making architecture decisions based on a single model’s tradeoffs. Now you’re choosing between three distinct capability profiles, and getting the wrong one means rebuilding. The clearest version of that risk is translation: GPT-Realtime-Translate handles input from over 70 languages, but if your users need output in more than 13, you’ll hit a ceiling OpenAI hasn’t published a timeline to expand.
OpenAI hasn’t disclosed per-model API pricing for the new suite. Cost per token or per minute of audio, at production scale, isn’t in the announcement. If cost is a deciding factor for your deployment, and at volume it usually is, you’ll need to run API calls and measure before committing to a tier.
What to Watch
This release extends a pattern worth tracking. OpenAI has been segmenting its model family by use case rather than offering one generalist API. GPT-5.5 Pro for reasoning. GPT-5.5 Instant for high-volume tasks. Now three realtime voice models for three distinct voice AI workflows. The implication isn’t just product differentiation, it’s that OpenAI is betting enterprises will pay for specialization rather than generalization. That’s a meaningful shift from “one model, many uses” to “pick the right model for the job.”
For teams building voice agents today: map your use case to the right model tier before touching the API. Reasoning, translation, and transcription aren’t interchangeable, and the 70-in/13-out translation constraint is the sharpest practical edge case to resolve in requirements before any architecture decision.
Wait for independent latency benchmarks before making sub-100ms guarantees to stakeholders. OpenAI’s own specs are a starting point, not a production SLA.