ADR-0041proposed

First-Class Media Handling from Connected Channels

Context

Photos, audio, and video sent through connected channels (Telegram, future: iMessage, Slack) are silently dropped. The gateway extension only handles text events — any media attachment vanishes.

This is a real problem: Joel sends a photo from Telegram, the agent sees nothing. The system pretends it didn’t happen.

OpenClaw handles this comprehensively via src/media-understanding/ — a pipeline that normalizes attachments from any channel, classifies them (image/audio/video/document), runs appropriate processing (vision description, audio transcription, video description), and injects the results back into the conversation context. Their MsgContext carries MediaPaths, MediaUrls, MediaTypes arrays alongside the text body.

We don’t need OpenClaw’s full multi-provider abstraction. We need: receive media → store it → process it → notify the agent with results.

Decision

Build a media processing pipeline as Inngest durable functions, triggered by a new media/received event. The gateway extension and channel adapters (Telegram bot) detect media attachments, download them to local storage, and emit the event. Inngest handles the rest.

Option A: Inline Processing (rejected)

Process media directly in the gateway extension or Telegram handler. Fast for small images, but:

  • Blocks the event loop
  • No retry on failure
  • No observability
  • Can’t reuse for other channels

Option B: Inngest Pipeline (chosen)

Media follows the same pattern as video ingest: event-driven, step-level retry, claim-check for large files.

Channel → download to /tmp → emit media/received → Inngest pipeline:
  1. classify (mime sniffing + extension)
  2. process (vision/transcribe/OCR depending on type)
  3. store (NAS archive + vault note if significant)
  4. notify gateway (inject results into agent session)

Architecture

Event schema:

"media/received": {
  data: {
    source: "telegram" | "imessage" | "slack" | "cli";
    type: "image" | "audio" | "video" | "document";
    localPath: string;        // downloaded to /tmp already
    mimeType: string;
    fileName?: string;
    fileSize: number;
    caption?: string;         // user text accompanying the media
    originSession?: string;   // route response back to sender
    metadata?: {
      telegramFileId?: string;
      width?: number;
      height?: number;
      duration?: number;      // audio/video seconds
    };
  };
};
 
"media/processed": {
  data: {
    source: string;
    type: string;
    localPath: string;
    description?: string;     // vision/transcription output
    transcript?: string;      // audio transcription
    archivePath?: string;     // NAS path if archived
    vaultNotePath?: string;   // vault note if created
    originSession?: string;
  };
};

Processing by type:

TypeProcessingTool
imageVision description → text summaryClaude vision API (base64)
audioTranscription → textmlx-whisper (local, same as video pipeline)
videoExtract audio → transcribe + frame sample → describemlx-whisper + vision
documentText extraction (PDF, etc.)defuddle / pdf-parse

Storage:

  • /tmp/joelclaw-media/{uuid}.{ext} — ephemeral processing
  • joel@three-body:/volume1/home/joel/media/{year}/{source}/ — NAS archive
  • ~/Vault/Resources/media/ — vault note with description/transcript (optional, only for significant content)

Gateway integration: The media/processed event routes through the existing gateway middleware. The agent receives a formatted message:

## 📎 Media received (image from Telegram)
 
[Description of what's in the image]
 
Caption: "personal note re kristina's upcoming surgery"
Source: /tmp/joelclaw-media/abc123.jpg

Phases

Phase 1: Images from Telegram (this ADR)

  • Telegram bot downloads photo via Bot API getFile
  • Saves to /tmp/joelclaw-media/
  • Emits media/received with type=image
  • Inngest function: base64 encode → Claude vision → description text
  • Gateway notification with description + original caption

Phase 2: Audio + Voice Messages

  • Telegram voice messages (.ogg) and audio files
  • Reuse existing mlx-whisper transcription from video pipeline
  • Inngest function: convert if needed → transcribe → text

Phase 3: Video + Documents

  • Short video clips: frame extraction + audio transcription
  • Documents: text extraction via defuddle/pdf-parse
  • Archive to NAS

Phase 4: Multi-Channel

  • Abstract the download step per channel
  • iMessage attachments (via BlueBubbles or imessage extension pattern from OpenClaw)
  • Slack file uploads

Consequences

  • Media stops being silently dropped. The agent sees what Joel sends.
  • Durable processing. Large audio files or slow vision API calls retry independently.
  • Claim-check pattern. File paths passed between steps, not base64 blobs in event payloads.
  • Channel-agnostic. Any channel that can download a file and emit media/received works.
  • Cost. Vision API calls per image. Local mlx-whisper for audio (free but GPU-bound).
  • Privacy. Images go to Claude API for vision. Audio stays local (mlx-whisper). This matches existing video pipeline behavior.

Credits

  • OpenClaw src/media-understanding/ — attachment normalization, capability-based processing, provider abstraction pattern
  • OpenClaw src/media/input-files.ts — MIME detection, size limits, PDF extraction
  • Existing joelclaw video pipeline — claim-check pattern, mlx-whisper integration, NAS archival