Flash News

OpenAI Launches Three New Voice Models in Realtime API

OpenAI has released three new voice models in the Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, focusing on dialogue, real-time translation, and streaming transcription, respectively.

GPT-Realtime-2 features GPT-5 level reasoning capabilities, with Big Bench Audio scores increasing from 81.4% to 96.6%, and Audio MultiChallenge multi-turn instruction following rising from 34.7% to 48.5%. The context window has been expanded to 128K, supporting five levels of reasoning intensity adjustment.

The new models include interaction optimizations such as pre-padding sentences, transparent tool invocation reporting, and proactive error explanations. Translate supports over 70 input languages and 13 output languages, and has been tested by Deutsche Telekom. Whisper is the streaming transcription version.

Source: Public Information

ABAB AI Insight

OpenAI previously launched Realtime API v1 and iterated on GPT-4o-realtime. This time, GPT-Realtime-2 directly targets GPT-5 level capabilities, following the o1/o3 reasoning series in the rapid advancement of real-time voice scenarios, continuing its product rhythm of "embedding cutting-edge reasoning into low-latency interfaces."

In terms of capital strategy, OpenAI is matching high computing resources with complex tasks through tiered pricing ($32/M for input, $64/M for output) and adjustable reasoning intensity, increasing enterprise developers' willingness to pay. Funding is shifting from general model training to the construction of real-time multimodal infrastructure, motivated by the aim to capture the voice agent and cross-border customer service markets, while quickly gathering user feedback through Playground to accelerate iteration.

Similar to Google Gemini Live's layout in real-time translation and the competition from Deepgram and AssemblyAI in the enterprise transcription field, real-time voice AI is currently transitioning from single-turn dialogue to long-context agent control.

Essentially, this represents a technological replacement: by achieving a transition of voice interfaces from "chat tools" to "programmable agent operating systems" through transparent tool invocation and context expansion, the mechanism aims to reduce user waiting anxiety and the sense of a black box, allowing real-time reasoning capabilities to directly replace human customer service and multilingual coordination in enterprise workflows, thereby concentrating pricing power on platforms that possess low-latency infrastructure.

ABAB News · Cognitive Law

The smarter the model, the more users need to hear it "working" rather than guessing whether it is alive or not.
The larger the context window, the faster the agent transitions from a toy to a process dominator.
Real-time is not just about speed, but about making users always feel that AI is breathing in sync with them.

Source

·ABAB News

05/07/2026, 10:06 PM·

2 min read

·3d ago