Skip to main content

Introduction

Learn how to enable real-time voice conversations in your app. OpenAI Realtime Voice allows your app to support fast, natural voice interactions using WebRTC or WebSocket connections. With sub-second latency, interruption handling, and function calling, it enables highly responsive voice assistants and real-time conversational experiences.

What You Can Build

With OpenAI Realtime Voice, your app can support:
  • Real-Time Voice Assistants – Enable users to speak and receive instant responses.
  • Voice-Based Customer Support – Build assistants that can answer questions and perform actions.
  • Interruptible Conversations – Allow natural conversation flow where users can interrupt and continue.
  • Function-Enabled Voice Agents – Connect voice interactions to app actions like fetching data or triggering workflows.
  • Phone and Call-Based Experiences – Create systems for voice-driven interactions similar to call agents.

How It Works

When OpenAI Realtime Voice is enabled, your app establishes a real-time connection using WebRTC or WebSocket. Your app can:
  • capture live audio input from users
  • stream audio to the AI in real time
  • receive and play generated voice responses instantly
  • handle interruptions during conversations
  • trigger function calls based on user intent
This creates a seamless, low-latency conversational experience that feels natural and interactive.

Example Prompts

You can use prompts like these to implement features: Add a real-time voice assistant Add a real-time voice assistant to my app that users can talk to naturally with instant responses using OpenAI Realtime. Add a voice support agent Add a voice-powered customer support agent to my app that can look up orders and answer questions in real time using OpenAI.

Common Use Cases

OpenAI Realtime Voice is commonly used for:
  • AI voice assistants
  • customer support voice agents
  • real-time interactive apps
  • voice-driven workflows
  • conversational interfaces

Best Practices

To get the best results:
  • optimize for low latency to keep conversations natural
  • handle interruptions gracefully
  • connect voice to meaningful actions using function calls
  • provide clear UI feedback (listening, speaking states)
  • design concise responses for better flow