xAI Launches Two Independent Audio APIs for Developers, for Voice Agents, Real-time Transcription, and Podcasts
xAI has launched two independent audio APIs: Grok Speech to Text and Grok Text to Speech, both derived from the same audio technology stack that supports Grok Voice, Tesla's in-car system, and Starlink customer service. These APIs are now available to developers as independent endpoints for applications such as voice agents, real-time transcription, accessibility tools, and podcasts.
Grok Speech to Text offers two modes: REST batch transcription and WebSocket real-time streaming, supporting word-level timestamps, speaker separation, multi-channel recognition, and Inverse Text Normalization for converting spoken language to standard text. It covers over 25 languages and allows seamless switching during conversations. Grok Text to Speech supports inline Speech Tags to control emotion and prosody, such as [laugh], [sigh], [whisper]. xAI's published word error rate comparison shows Grok at 6.9% overall, lower than ElevenLabs at 9.0%, Deepgram at 11.0%, and AssemblyAI at 12.9%; in telephone call entity recognition, Grok is at 5.0%. Pricing is set at $0.10/hour for STT batch processing, $0.20/hour for streaming, and $4.20 per million characters for TTS.
Source: Public Information