Build a Voice Agent That Answers the Phone
You want your agent to answer the phone. Not a chatbot with a phone tree — an actual voice conversation backed by an LLM, with access to your tools, your calendar, your context. Here’s the shape of that system.
The topology
Four services, one room:
Phone call → SIP trunk provider → LiveKit server → Voice worker → LLM
→ STT (speech-to-text)
→ TTS (text-to-speech)LiveKit is the media server. It handles WebRTC, SIP, room management, and media routing. Open source, self-hostable. This is the center of the system.
A SIP trunk provider (Telnyx, Twilio, Vonage) gives you a phone number and converts phone calls to SIP, which LiveKit speaks natively.
The voice worker is your code. It joins LiveKit rooms, listens to audio, runs it through STT, sends text to your LLM, and plays TTS audio back. The LiveKit Agents SDK (Python) handles the plumbing.
STT/TTS services — Deepgram for speech-to-text, ElevenLabs or Cartesia or OpenAI for text-to-speech. The Agents SDK has built-in plugins for all of these.
Running LiveKit
Docker Compose
services:
livekit:
image: livekit/livekit-server:latest
ports:
- "7880:7880" # HTTP + WebSocket
- "7881:7881" # RTC (WebRTC)
- "7882:7882" # TURN/TCP
- "5060:5060/udp" # SIP
volumes:
- ./livekit.yaml:/etc/livekit.yaml
command: ["--config", "/etc/livekit.yaml"]
restart: unless-stoppedKubernetes (Helm)
helm repo add livekit https://helm.livekit.io
helm install livekit livekit/livekit-server \
-f livekit-values.yaml \
--namespace voiceEither way, you need a config file:
# livekit.yaml
port: 7880
rtc:
port_range_start: 50000
port_range_end: 60000
use_external_ip: true
keys:
your-api-key: your-api-secret
sip:
enabled: true
logging:
level: infoGenerate your API key/secret pair with the LiveKit CLI:
lk generate-keysThe SIP problem
SIP requires a public IP that can receive UDP traffic. If your server is behind NAT (home network, most cloud setups), you have two options:
- TURN relay — LiveKit has built-in TURN. Configure
rtc.turnin the config. - Public proxy — a tiny VM with a public IP that forwards SIP to your server via a VPN tunnel (Tailscale, WireGuard).
Option 2 is more reliable for phone calls. A $5/month VPS with Tailscale and an iptables rule is all you need.
The SIP trunk
Register your LiveKit instance with your SIP provider. The concepts are the same across providers:
- Get a phone number — DID (Direct Inward Dialing)
- Create a SIP trunk — point it at your LiveKit server’s public IP on port 5060
- Configure a dispatch rule in LiveKit — map inbound SIP calls to rooms
# Create SIP trunk (inbound)
lk sip trunk create \
--name "my-trunk" \
--numbers "+15551234567" \
--inbound-addresses "your-sip-provider-ip"
# Create dispatch rule — put callers in a room
lk sip dispatch-rule create \
--name "default" \
--trunk-ids "your-trunk-id" \
--room-prefix "call-"When someone calls your number, LiveKit creates a room like call-15551234567 and places the SIP participant in it. Your voice worker watches for new rooms and joins automatically.
The voice worker
This is a Python process using the LiveKit Agents SDK:
pip install "livekit-agents[openai,deepgram,elevenlabs,silero]"The minimal shape:
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice import AgentSession, Agent
from livekit.plugins import deepgram, elevenlabs, openai
class MyAgent(Agent):
def __init__(self):
super().__init__(
instructions="""You are a helpful voice assistant.
Keep responses concise — this is a phone call, not an essay.""",
)
async def entrypoint(ctx: JobContext):
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
session = AgentSession(
stt=deepgram.STT(),
llm=openai.LLM(model="gpt-4o"),
tts=elevenlabs.TTS(voice_id="your-voice-id"),
)
await session.start(agent=MyAgent(), room=ctx.room)
if __name__ == "__main__":
cli.run_app(WorkerOptions(
entrypoint_fnc=entrypoint,
api_key="your-livekit-api-key",
api_secret="your-livekit-api-secret",
ws_url="ws://localhost:7880",
))Run it:
python main.py startThe worker registers with LiveKit and waits. When a call comes in and creates a room, the worker joins and starts the conversation loop: listen → transcribe → think → speak.
Adding tools
The agent becomes useful when it can do things:
from livekit.agents import function_tool
class MyAgent(Agent):
def __init__(self):
super().__init__(
instructions="You are Joel's assistant with access to calendar and tasks.",
)
@function_tool()
async def check_calendar(self) -> str:
"""Check today's calendar events."""
# Your calendar API call here
return "You have 3 meetings today..."
@function_tool()
async def add_task(self, description: str) -> str:
"""Add a task to the todo list."""
# Your task API call here
return f"Added: {description}"Tools are standard OpenAI-style function calling. The LLM decides when to invoke them. The agent speaks the result.
Durability
The voice worker needs to be running whenever a call might come in. If it’s down, callers hear ringing with no answer.
Docker Compose: restart: unless-stopped handles crashes.
Kubernetes: A Deployment with replicas: 1 and health probes.
Bare metal / Mac: A process supervisor — systemd on Linux, launchd on macOS.
# docker-compose.yml addition
voice-worker:
build: ./voice-worker
environment:
- LIVEKIT_URL=ws://livekit:7880
- LIVEKIT_API_KEY=your-key
- LIVEKIT_API_SECRET=your-secret
- DEEPGRAM_API_KEY=your-key
- OPENAI_API_KEY=your-key
- ELEVENLABS_API_KEY=your-key
restart: unless-stopped
depends_on:
- livekitThe key insight: the worker is a long-running process, not a request handler. It maintains a WebSocket connection to LiveKit and responds to room events. Treat it like a daemon, not a web server.
Outbound calls
Your agent can also initiate calls:
# Create a SIP participant in a room (outbound call)
lk sip participant create \
--room "morning-briefing" \
--trunk-id "your-trunk-id" \
--number "+15559876543" \
--dtmf ""Or via the API:
from livekit import api
lk = api.LiveKitAPI(
url="http://localhost:7880",
api_key="your-key",
api_secret="your-secret",
)
await lk.sip.create_sip_participant(
api.CreateSIPParticipantRequest(
room_name="morning-briefing",
sip_trunk_id="your-trunk-id",
sip_call_to="+15559876543",
)
)Your worker joins the room, the phone rings, the person answers, conversation starts. Use this for scheduled briefings, notifications, or anything where the agent needs to reach out.
The full Docker Compose
version: "3.8"
services:
livekit:
image: livekit/livekit-server:latest
ports:
- "7880:7880"
- "7881:7881"
- "5060:5060/udp"
volumes:
- ./livekit.yaml:/etc/livekit.yaml
command: ["--config", "/etc/livekit.yaml"]
restart: unless-stopped
voice-worker:
build: ./voice-worker
environment:
LIVEKIT_URL: ws://livekit:7880
LIVEKIT_API_KEY: ${LIVEKIT_API_KEY}
LIVEKIT_API_SECRET: ${LIVEKIT_API_SECRET}
DEEPGRAM_API_KEY: ${DEEPGRAM_API_KEY}
OPENAI_API_KEY: ${OPENAI_API_KEY}
ELEVENLABS_API_KEY: ${ELEVENLABS_API_KEY}
restart: unless-stopped
depends_on:
- livekitPatterns worth knowing
Caller allowlist. Check the caller’s phone number before letting the agent engage. Reject or route to voicemail for unknown numbers.
Call transcripts. Capture the full conversation and persist it. Every voice session produces structured data — decisions made, tasks created, questions asked. Feed this back into your agent’s memory.
Graceful interruption. The LiveKit Agents SDK handles barge-in (talking over the agent) natively. But tune the VAD (Voice Activity Detection) sensitivity — too aggressive and it cuts off on background noise, too passive and it feels unresponsive.
Session context. Load relevant context before the call starts — today’s calendar, pending tasks, recent messages. The agent should know what’s going on before the human says a word.
Latency budget. STT (~200ms) + LLM (~500-1500ms) + TTS (~200ms) = ~1-2 seconds minimum. Users tolerate this for the first exchange but it compounds. Keep LLM responses short. Stream TTS (start speaking before the full response is generated).
Voice identity. Pick a consistent voice. ElevenLabs lets you clone or design voices. The agent should sound the same every time — it’s part of the trust relationship.
Services and costs
| Service | Purpose | Rough cost |
|---|---|---|
| LiveKit | Media server | Free (self-hosted) |
| Telnyx | SIP trunk + phone number | ~$1/mo + $0.01/min |
| Deepgram | Speech-to-text | ~$0.0043/min |
| OpenAI / Anthropic | LLM | Varies by model |
| ElevenLabs | Text-to-speech | ~$0.18/1K chars |
| Cartesia | TTS (alternative, lower latency) | ~$0.10/1K chars |
Total for a personal voice agent with moderate daily use: roughly $20-50/month on top of your LLM costs.