Agent-Secrets Lease Deduplication & OTEL Integration
Context
The agent-secrets daemon accumulates thousands of stale leases because the system-bus worker (and other callers) acquire new leases on every restart without revoking old ones. With 24h TTLs and frequent restarts (~8/day), this produces 2,600+ active leases for 56 secrets.
Additionally, the daemon operates without any observability — crashes, restart cycles, and lease churn are invisible unless someone manually checks logs.
Observed failure (2026-02-22)
- Daemon entered crash loop (3 restarts in 3 minutes) after an unauthorized signal attempt.
- launchd
KeepAliveauto-healed but the zombie socket window caused ~2 min of downtime. - 2,625 active leases found (2,206 with 24h TTL), 13MB audit log.
- Root cause of lease accumulation:
start.shandserve.tsboth use--ttl 24hand create new leases on every worker restart.
Decision
1. Lease deduplication by client_id + secret_name
When a client acquires a lease for a secret it already holds an active lease on, the daemon replaces the old lease (revoke + delete) rather than stacking a new one. This matches how most credential managers work.
- Lookup key:
(client_id, secret_name)where the existing lease is not expired and not revoked. - Old lease is silently revoked and removed from the map.
- Audit logs both the revocation (as
lease_replace) and the new acquisition. - No CLI changes needed — dedup is server-side behavior.
2. OTEL integration via joelclaw CLI
When joelclaw is available on the system, the daemon emits structured OTEL events for:
daemon.started/daemon.stopped— lifecycle eventslease.acquired/lease.replaced/lease.expired— lease operations (sampled — only replacements and errors at info level, routine acquires at debug)daemon.crash_recovered— when startup detects and removes a stale socket
Emission is best-effort (fire-and-forget exec) and never blocks daemon operation.
Consequences
- Lease count drops from O(restarts × secrets × TTL) to O(unique clients × secrets).
- leases.json stays small (< 100 entries typically).
- Audit log growth slows proportionally.
- OTEL events make daemon health visible in
joelclaw otel list/search/stats. - No breaking changes to CLI or RPC protocol.