WebSocket Voice Turns LLM Agents from Request Cycles into Conversations

articleaiwebsocketvoiceopenailivekitrealtimeinfrastructureagent-loopsevent-bustools

Persistent websocket streams map better to joelclaw's event-driven loops than request-response polling when voice sessions drive actions

Most voice-agent glue code treats talking to an LLM like tiny, isolated API calls: send an audio chunk, wait, repeat. That model leaks latency and throws away flow. The OpenAI WebSocket mode guide reframes voice as a persistent stream instead, which is closer to a conversation than a queue. Pairing that transport with LiveKit puts audio, state, and control on one continuously open lane.

The clever part is not that the docs are shiny; it is that this shape fits how the system already thinks. In joelclaw, you already pay for every useful event in an event bus, and Inngest durable runs already assume long-lived, replayable flow. A websocket session can emit partial transcriptions, interruption signals, and action intents as they happen instead of waiting for each response boundary. For agent loops, that removes a layer of conversion glue.

For a practical build, the upside is obvious and the risk is equally obvious: long-lived connections need cleanup and recovery discipline. WebSockets can be dropped, token streams can desync, and the wrong defaults make retry logic look random. Treating voice as stream events is a transport win only if ownership boundaries, backpressure, and idempotent side effects are explicit. That makes this relevant to joelclaw gateway and system observability, where the hard part is not speed but keeping a durable audit trail of what happened during a live stream.

Key Ideas

  • Session = open channel: websocket voice keeps the transport alive across turns, so transcription, tooling and control signals can flow without one-off round trips.
  • Lower coordination friction: when stream payloads are eventified, agent-loops can process partials and still maintain deterministic handoff points.
  • Resilience is part of the API surface: stateful streams force explicit lifecycle handling such as reconnect, heartbeat, and cancellation semantics via the same event model used by your event bus.
  • Gateway architecture fit: joelclaw gateway can treat websocket frames as commands in-flight, which aligns with existing worker and Inngest integration patterns.
  • Less glue, clearer contract: you still need protocol discipline, but fewer adapters around start/stop loops means less state conversion overhead in production.