GPT-realtime for realtime voice agents (2025): Build a Voice Agent with the Shortest Possible Code

The landscape of conversational AI has always been a thrilling frontier, but August 28, 2025 marked a real shift. OpenAI announced the general availability of its Realtime API, powered by the new gpt-realtime voice model. This isn’t just an upgrade—it’s a leap forward for voice agent implementation. For developers eager to build voice-to-voice interactions, it means creating sophisticated, human-like agents with far less complexity. Forget clunky, multi-stage pipelines; gpt-realtime enables seamless, single-model realtime voice processing with remarkably little code. The result isn’t just speed—it’s richer, more natural interactions that feel closer to a conversation with a knowledgeable assistant.

The `gpt-realtime` revolution: superior conversations for realtime voice agents

At the core of this shift is gpt-realtime itself. Unlike prior architectures that chained together separate speech-to-text and text-to-speech models, gpt-realtime is an advanced, end-to-end speech-to-speech system. This integrated approach slashes latency and preserves the subtle nuances and emotional inflections of human speech.

Imagine an AI that doesn’t just parse words—it responds to laughter, handles code-switching mid-sentence, and adapts its tone. Benchmarks reflect this progress, with reported scores of 82.8% on Big Bench Audio for reasoning and 30.5% on MultiChallenge Audio for instruction following. Two new voices—Cedar and Marin—arrive exclusively with the Realtime API, while existing voices have been refreshed for naturalness and expressiveness. It’s the kind of 2025 voice technology many have been waiting for.

Crafting your agent: efficient usage with the Realtime API

Building a gpt-realtime voice agent is straightforward thanks to the Realtime API and the OpenAI Agents SDK. The SDK abstracts audio transport and session management, so most developers can rely on the OpenAIRealtimeWebRTC transport to handle audio transmission with minimal configuration.

Here’s the essence:

Create a RealtimeSession with your chosen model (gpt-realtime), audio formats, and turn detection settings
Let the session manage conversation history so your agent maintains fluid context
Use built-in function calling to integrate tools and data sources. gpt-realtime reports 66.5% precision on ComplexFuncBench and supports asynchronous function calling, allowing the conversation to continue while a tool runs

This approach reduces boilerplate and keeps you focused on core logic and UX rather than intricate audio handling.

Expanded capabilities and practical considerations

Beyond the core model, the Realtime API adds features that materially improve voice agents:

Multimodality: Add images, photos, and screenshots alongside audio or text so users can ask “what do you see?” and ground conversations in visual context
SIP support: Connect directly to public phone networks, PBX systems, and other SIP endpoints to expand deployment options
Reusable prompts: Keep tone and behavior consistent across sessions
Conversation history management: Automatic within RealtimeSession
Guardrails: Safety checks monitor responses for rule violations and can cut off unwanted speech

Community feedback notes that aggressive Voice Activity Detection (VAD) settings in the Playground can occasionally cause self-interruptions, and the model may sometimes make follow-up commitments it can’t fulfill. Test thoroughly and tune VAD and turn-taking for your use case.

On costs, gpt-realtime is priced at $32 / 1M audio input tokens and $64 / 1M audio output tokens, a reported 20% reduction from preview pricing. Manage context carefully to control spend. And always consider data privacy obligations (e.g., GDPR in the EU, CCPA in California) when handling sensitive user information.

Trust, safety, and the road ahead

OpenAI has integrated multiple safeguards and mitigations into gpt-realtime and the Realtime API, including active classifiers that can halt harmful conversations. You can further enhance safety with the Agents SDK. Usage policies prohibit misuse (such as spam or deception), and experiences must clearly indicate AI interaction. Preset voices also help reduce the risk of malicious impersonation. With EU data residency options and enterprise privacy commitments, the platform is designed to support responsible deployment.

The gpt-realtime revolution: superior conversations for realtime voice agents

Crafting your agent: efficient usage with the Realtime API

Expanded capabilities and practical considerations

Trust, safety, and the road ahead

The `gpt-realtime` revolution: superior conversations for realtime voice agents