Build a Voice Agent That Answers the Phone

voicelivekittutorialagentssippatterns

You want your agent to answer the phone. Not a chatbot with a phone tree — an actual voice conversation backed by an LLM, with access to your tools, your calendar, your context. Here’s the shape of that system.

The topology

Four services, one room:

Phone call → SIP trunk provider → LiveKit server → Voice worker → LLM
                                                  → STT (speech-to-text)
                                                  → TTS (text-to-speech)

LiveKit is the media server. It handles WebRTC, SIP, room management, and media routing. Open source, self-hostable. This is the center of the system.

A SIP trunk provider (Telnyx, Twilio, Vonage) gives you a phone number and converts phone calls to SIP, which LiveKit speaks natively.

The voice worker is your code. It joins LiveKit rooms, listens to audio, runs it through STT, sends text to your LLM, and plays TTS audio back. The LiveKit Agents SDK (Python) handles the plumbing.

STT/TTS servicesDeepgram for speech-to-text, ElevenLabs or Cartesia or OpenAI for text-to-speech. The Agents SDK has built-in plugins for all of these.

Running LiveKit

Docker Compose

 
services:
  livekit:
    image: livekit/livekit-server:latest
    ports:
      - "7880:7880"   # HTTP + WebSocket
      - "7881:7881"   # RTC (WebRTC)
      - "7882:7882"   # TURN/TCP
      - "5060:5060/udp" # SIP
    volumes:
      - ./livekit.yaml:/etc/livekit.yaml
    command: ["--config", "/etc/livekit.yaml"]
    restart: unless-stopped

Kubernetes (Helm)

helm repo add livekit https://helm.livekit.io
helm install livekit livekit/livekit-server \
  -f livekit-values.yaml \
  --namespace voice

Either way, you need a config file:

# livekit.yaml
port: 7880
rtc:
  port_range_start: 50000
  port_range_end: 60000
  use_external_ip: true
keys:
  your-api-key: your-api-secret
sip:
  enabled: true
logging:
  level: info

Generate your API key/secret pair with the LiveKit CLI:

lk generate-keys

The SIP problem

SIP requires a public IP that can receive UDP traffic. If your server is behind NAT (home network, most cloud setups), you have two options:

  1. TURN relay — LiveKit has built-in TURN. Configure rtc.turn in the config.
  2. Public proxy — a tiny VM with a public IP that forwards SIP to your server via a VPN tunnel (Tailscale, WireGuard).

Option 2 is more reliable for phone calls. A $5/month VPS with Tailscale and an iptables rule is all you need.

The SIP trunk

Register your LiveKit instance with your SIP provider. The concepts are the same across providers:

  1. Get a phone number — DID (Direct Inward Dialing)
  2. Create a SIP trunk — point it at your LiveKit server’s public IP on port 5060
  3. Configure a dispatch rule in LiveKit — map inbound SIP calls to rooms
# Create SIP trunk (inbound)
lk sip trunk create \
  --name "my-trunk" \
  --numbers "+15551234567" \
  --inbound-addresses "your-sip-provider-ip"
 
# Create dispatch rule — put callers in a room
lk sip dispatch-rule create \
  --name "default" \
  --trunk-ids "your-trunk-id" \
  --room-prefix "call-"

When someone calls your number, LiveKit creates a room like call-15551234567 and places the SIP participant in it. Your voice worker watches for new rooms and joins automatically.

The voice worker

This is a Python process using the LiveKit Agents SDK:

pip install "livekit-agents[openai,deepgram,elevenlabs,silero]"

The minimal shape:

from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice import AgentSession, Agent
from livekit.plugins import deepgram, elevenlabs, openai
 
class MyAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="""You are a helpful voice assistant. 
            Keep responses concise — this is a phone call, not an essay.""",
        )
 
async def entrypoint(ctx: JobContext):
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
 
    session = AgentSession(
        stt=deepgram.STT(),
        llm=openai.LLM(model="gpt-4o"),
        tts=elevenlabs.TTS(voice_id="your-voice-id"),
    )
 
    await session.start(agent=MyAgent(), room=ctx.room)
 
if __name__ == "__main__":
    cli.run_app(WorkerOptions(
        entrypoint_fnc=entrypoint,
        api_key="your-livekit-api-key",
        api_secret="your-livekit-api-secret",
        ws_url="ws://localhost:7880",
    ))

Run it:

python main.py start

The worker registers with LiveKit and waits. When a call comes in and creates a room, the worker joins and starts the conversation loop: listen → transcribe → think → speak.

Adding tools

The agent becomes useful when it can do things:

from livekit.agents import function_tool
 
class MyAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are Joel's assistant with access to calendar and tasks.",
        )
 
    @function_tool()
    async def check_calendar(self) -> str:
        """Check today's calendar events."""
        # Your calendar API call here
        return "You have 3 meetings today..."
 
    @function_tool()
    async def add_task(self, description: str) -> str:
        """Add a task to the todo list."""
        # Your task API call here
        return f"Added: {description}"

Tools are standard OpenAI-style function calling. The LLM decides when to invoke them. The agent speaks the result.

Durability

The voice worker needs to be running whenever a call might come in. If it’s down, callers hear ringing with no answer.

Docker Compose: restart: unless-stopped handles crashes.

Kubernetes: A Deployment with replicas: 1 and health probes.

Bare metal / Mac: A process supervisor — systemd on Linux, launchd on macOS.

# docker-compose.yml addition
  voice-worker:
    build: ./voice-worker
    environment:
      - LIVEKIT_URL=ws://livekit:7880
      - LIVEKIT_API_KEY=your-key
      - LIVEKIT_API_SECRET=your-secret
      - DEEPGRAM_API_KEY=your-key
      - OPENAI_API_KEY=your-key
      - ELEVENLABS_API_KEY=your-key
    restart: unless-stopped
    depends_on:
      - livekit

The key insight: the worker is a long-running process, not a request handler. It maintains a WebSocket connection to LiveKit and responds to room events. Treat it like a daemon, not a web server.

Outbound calls

Your agent can also initiate calls:

# Create a SIP participant in a room (outbound call)
lk sip participant create \
  --room "morning-briefing" \
  --trunk-id "your-trunk-id" \
  --number "+15559876543" \
  --dtmf ""

Or via the API:

from livekit import api
 
lk = api.LiveKitAPI(
    url="http://localhost:7880",
    api_key="your-key",
    api_secret="your-secret",
)
 
await lk.sip.create_sip_participant(
    api.CreateSIPParticipantRequest(
        room_name="morning-briefing",
        sip_trunk_id="your-trunk-id",
        sip_call_to="+15559876543",
    )
)

Your worker joins the room, the phone rings, the person answers, conversation starts. Use this for scheduled briefings, notifications, or anything where the agent needs to reach out.

The full Docker Compose

version: "3.8"
 
services:
  livekit:
    image: livekit/livekit-server:latest
    ports:
      - "7880:7880"
      - "7881:7881"
      - "5060:5060/udp"
    volumes:
      - ./livekit.yaml:/etc/livekit.yaml
    command: ["--config", "/etc/livekit.yaml"]
    restart: unless-stopped
 
  voice-worker:
    build: ./voice-worker
    environment:
      LIVEKIT_URL: ws://livekit:7880
      LIVEKIT_API_KEY: ${LIVEKIT_API_KEY}
      LIVEKIT_API_SECRET: ${LIVEKIT_API_SECRET}
      DEEPGRAM_API_KEY: ${DEEPGRAM_API_KEY}
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      ELEVENLABS_API_KEY: ${ELEVENLABS_API_KEY}
    restart: unless-stopped
    depends_on:
      - livekit

Patterns worth knowing

Caller allowlist. Check the caller’s phone number before letting the agent engage. Reject or route to voicemail for unknown numbers.

Call transcripts. Capture the full conversation and persist it. Every voice session produces structured data — decisions made, tasks created, questions asked. Feed this back into your agent’s memory.

Graceful interruption. The LiveKit Agents SDK handles barge-in (talking over the agent) natively. But tune the VAD (Voice Activity Detection) sensitivity — too aggressive and it cuts off on background noise, too passive and it feels unresponsive.

Session context. Load relevant context before the call starts — today’s calendar, pending tasks, recent messages. The agent should know what’s going on before the human says a word.

Latency budget. STT (~200ms) + LLM (~500-1500ms) + TTS (~200ms) = ~1-2 seconds minimum. Users tolerate this for the first exchange but it compounds. Keep LLM responses short. Stream TTS (start speaking before the full response is generated).

Voice identity. Pick a consistent voice. ElevenLabs lets you clone or design voices. The agent should sound the same every time — it’s part of the trust relationship.

Services and costs

ServicePurposeRough cost
LiveKitMedia serverFree (self-hosted)
TelnyxSIP trunk + phone number~$1/mo + $0.01/min
DeepgramSpeech-to-text~$0.0043/min
OpenAI / AnthropicLLMVaries by model
ElevenLabsText-to-speech~$0.18/1K chars
CartesiaTTS (alternative, lower latency)~$0.10/1K chars

Total for a personal voice agent with moderate daily use: roughly $20-50/month on top of your LLM costs.

Further reading