Full-Stack Observability + JoelClaw Design System
Status
implemented
Context
The system has been running for 7 days with zero observability infrastructure. ADR-0006 (Prometheus + Grafana) and ADR-0033 (VictoriaMetrics + Grafana) were proposed but never deployed. Both targeted traditional infrastructure metrics (CPU, memory, pod restarts) for a system that has since evolved into an event-driven agent platform with 61+ Inngest functions, a memory pipeline, webhook integrations, and a real-time web dashboard.
The Actual Pain (2026-02-21 session)
Every major bug discovered today was found by accident, not by alerts:
| Failure | Duration Silent | How Discovered |
|---|---|---|
| Vercel webhook 401s (missing secret) | Days | Manual log grep |
| Recall doing keyword search, not semantic | Since Typesense migration | Code audit |
| Gateway replay flood (389 stale messages) | Since message-store creation | User report |
| Convex dual-write silently dropping | Unknown | Code audit |
| Echo-fizzle implemented but never registered | Since implementation | Code audit |
| Batch digest never flushing (3 events stuck) | Hours | Manual Redis inspect |
The pattern: the system fails quietly and Joel stumbles across it later. Traditional metrics wouldn’t have caught most of these — they’re pipeline logic failures, not resource exhaustion.
What Exists Today
- slog: Manual CLI for infrastructure changes. Append-only JSONL. Good for audit trail, bad for real-time.
- Worker stderr: Structured-ish console.log. Not queryable. Lost on restart.
- Inngest dashboard: Step traces per function run. Requires manual inspection.
- Heartbeat: 15-minute health check. Checks pod liveness but not pipeline correctness.
- joelclaw.com: /network (live pod status), /syslog (manual slog entries), /memory (observations), /dashboard (stats).
- Convex contentResources: Unified polymorphic table. Already stores network status, syslog, memory observations.
Design System Gap
joelclaw.com pages are built ad-hoc — each page reinvents card layouts, filter chips, status badges, stat readouts. No shared component library, no consistent design language, no reusable patterns. The shadcn registry pattern solves this: components are authored in the monorepo, published as a registry, and consumed via shadcn add.
Decision
1. Observability Architecture
Structured event logging → Typesense (search + high cardinality) → Convex (real-time UI) → joelclaw.com (dashboards) → Agent (auto-diagnosis) → Telegram (escalation)
Collection Layer
Every subsystem emits structured log events with consistent schema:
interface OtelEvent {
id: string; // UUID
timestamp: number; // epoch ms
level: "debug" | "info" | "warn" | "error" | "fatal";
source: string; // e.g. "worker", "gateway", "webhook", "inngest", "memory"
component: string; // e.g. "observe.ts", "vercel-provider", "recall"
action: string; // e.g. "webhook.received", "function.completed", "search.executed"
duration_ms?: number; // for timed operations
success: boolean;
error?: string; // error message if !success
metadata: Record<string, unknown>; // high-cardinality fields — function ID, event ID, deployment ID, etc.
}Sources:
- Worker: Instrument all Inngest function starts/completions/failures. Webhook receipt + verification. Typesense/Convex writes.
- Gateway: Event drain cycles. Message store operations. Telegram send/receive. Session health.
- Infrastructure: Pod status (from existing network-status-update). Daemon health. Disk/memory (from
vm_stat,df). - Pipeline: Memory observe→reflect→triage→promote chain. Content sync. Friction detection runs.
Storage Layer
- Typesense
otel_eventscollection: Primary store. Auto-embedding onaction + error + metadatafor semantic search. Faceted onsource,component,level,success. High cardinality fields in metadata (function IDs, deployment IDs, session IDs). Retention: unlimited (storage is cheap, NAS-backed archive for cold data). - Convex
contentResourceswith typeotel_event: Real-time reactive feed for the UI. Rolling watch window — default last 30 minutes (OTEL_EVENTS_CONVEX_WINDOW_HOURS=0.5) of warn/error/fatal events. Purge older events opportunistically. Debug/info events stay in Typesense only. - Redis streams: Hot buffer for real-time alerting. Agent subscribes. Events older than 1h auto-trimmed.
Agent Consumption
The agent gets observability data through:
- Proactive: Heartbeat check function queries Typesense for error rate in last 15 minutes. If above threshold → investigate → auto-fix or escalate.
- Reactive:
joelclaw otelCLI command — query events by source, level, time range, component. Agent uses this mid-session for diagnosis. - Alert stream: Redis pub/sub for fatal/error events. Gateway extension subscribes and injects into agent context.
Escalation
Agent acts first. Escalation to Joel (via Telegram) only when:
- Error rate exceeds threshold AND agent can’t auto-fix
- Fatal events (pod crash, worker down)
- Pipeline stall detected (no events from a source for > 30 minutes)
2. JoelClaw Design System (shadcn Registry)
Bootstrap a shadcn component registry in the monorepo for consistent, mobile-first UI across all joelclaw.com pages.
Registry Structure
packages/ui/
├── registry.json # shadcn registry manifest
├── src/
│ ├── components/
│ │ ├── status-badge.tsx # ● green/yellow/red with label
│ │ ├── metric-card.tsx # stat readout with trend arrow
│ │ ├── event-timeline.tsx # chronological event feed
│ │ ├── filter-chips.tsx # faceted filter bar (reusable)
│ │ ├── pipeline-flow.tsx # visual pipeline stage indicators
│ │ ├── alert-banner.tsx # error/warning banner with action
│ │ ├── data-table.tsx # sortable, filterable table
│ │ ├── sparkline.tsx # inline mini chart for trends
│ │ ├── section-header.tsx # font-pixel uppercase header
│ │ └── search-input.tsx # Typesense-powered search
│ ├── hooks/
│ │ ├── use-otel-events.ts # Convex useQuery for real-time events
│ │ ├── use-pipeline-health.ts # aggregated pipeline status
│ │ └── use-system-status.ts # infrastructure health summary
│ └── lib/
│ ├── otel-client.ts # fetch wrapper for /api/otel
│ └── format.ts # duration, timestamp, byte formattersDesign Tokens
- Font: Geist Pixel Square for data readouts + section headers (bitmap/terminal feel), Geist Mono for body
- Theme: catppuccin-macchiato (matches vault + code blocks)
- Mobile-first: Touch targets ≥ 44px, single-column default, responsive breakpoints
- Status colors: Green (operational), Yellow (degraded), Red (down), Neutral (unknown)
Component Composition Pattern
Following Vercel composition patterns (ADR skill): compound components, render props for customization, context providers for shared state. Server Components for data fetching, Client Components only for interactivity.
// Example: Pipeline health dashboard composition
<PipelineHealth>
<PipelineHealth.Summary /> {/* Server Component — ISR cached */}
<PipelineHealth.StageList> {/* Server Component */}
<PipelineHealth.Stage name="webhooks" />
<PipelineHealth.Stage name="memory" />
<PipelineHealth.Stage name="content-sync" />
</PipelineHealth.StageList>
<PipelineHealth.EventFeed /> {/* Client Component — Convex useQuery */}
</PipelineHealth>3. Dashboard Pages
/system (new — replaces need for Grafana)
Mobile-first system overview. Three sections:
- Health Summary — traffic light for each subsystem (infra, pipeline, agent). Server Component, ISR 60s.
- Event Feed — real-time stream of warn/error events from Convex. Client Component. Filterable by source/level.
- Pipeline Stages — visual flow showing each pipeline’s last-run status, throughput, error rate. Server Component.
/system/events (new)
Full event explorer. Typesense-powered search across all otel_events. Faceted filters: source, component, level, time range. High cardinality metadata searchable. Mobile-friendly table with expandable rows.
Existing pages enhanced
- /network: Already data-driven. Add sparklines for pod restart counts, uptime trends.
- /syslog: Replace manual slog entries with auto-collected otel_events filtered to infrastructure actions.
- /dashboard: Add pipeline health widgets using design system components.
4. Sentry Role (and “Sentry at Home” Decision)
Sentry is adopted as a secondary signal for exception tracking, stack traces, and distributed tracing UX. It is not the system of record for joelclaw observability; Typesense + Convex + joelclaw.com remain canonical.
For this system, self-hosted Sentry is deferred until there is an explicit hard requirement (air-gapped ops, policy requirement, or sustained traffic that justifies dedicated ops overhead). Current rationale:
- Sentry self-hosted is positioned as low-volume / proof-of-concept deployment and requires meaningful host resources (minimum 4 CPU, 16 GB RAM + 16 GB swap, 20 GB disk).
- Self-hosted releases are monthly CalVer snapshots, with regular upgrade pressure and expected downtime during upgrades.
- Sentry self-hosted docs do not provide direct scaling guidance for custom Kubernetes topologies; maintenance burden shifts fully to us.
Decision:
- Now: instrument Sentry SDKs for web + worker/gateway paths where it accelerates diagnosis.
- Later (optional): deploy self-hosted Sentry only on a dedicated host profile and explicit ops runbook, not on the main single-node control-plane by default.
Consequences
Positive
- Silent failures become impossible — every subsystem emits structured events, agent monitors continuously
- Agent can self-diagnose using
joelclaw otelCLI and Typesense queries — no human grepping - Consistent UI via design system — new pages take hours instead of days
- High cardinality metadata enables ad-hoc investigation (“show me all events for deployment dpl_xxx”)
- Mobile-first means Joel can glance at system health from anywhere
- Storage is cheap — keep everything, search it later
Negative
- Instrumentation effort — every function/webhook/pipeline needs event emission added
- Convex write volume increases (mitigated: only warn+ events, 30m rolling watch window)
- Design system bootstrap is upfront work before it pays off
- Two storage backends for events (Typesense + Convex) adds complexity
Risks
- Alert fatigue if thresholds are too sensitive — start conservative, tune with echo/fizzle-style feedback
- Circular dependency: o11y pipeline monitors itself. Mitigation: fatal alerts go direct to Telegram (bypass Inngest)
Implementation Plan
Phase 0: Contracts and Guardrails (day 1)
- Add canonical event contract in
packages/system-bus/src/observability/otel-event.ts(new), including runtime validation. - Add emitter helpers in
packages/system-bus/src/observability/emit.ts(new) with severity mapping and source/component conventions. - Add storage adapter in
packages/system-bus/src/observability/store.ts(new) that dual-writes:- Typesense collection
otel_events(full retention window) - Convex
contentResourcestypeotel_event(warn+ rolling window)
- Typesense collection
- Document env contract in
.env.example/ ops docs:OTEL_EVENTS_ENABLEDOTEL_EVENTS_CONVEX_WINDOW_HOURSSENTRY_DSN(optional)SENTRY_ENVIRONMENT
- Rollout gate: no ingestion unless contract + adapter tests pass.
Phase 1: Worker and Gateway Instrumentation (week 1)
- Worker:
- Add event emission at ingress/egress in
packages/system-bus/src/serve.ts. - Wrap critical functions in
packages/system-bus/src/inngest/functions/index.tsregistration path (start/success/fail envelopes). - Add explicit instrumentation in high-impact functions:
packages/system-bus/src/inngest/functions/observe.tspackages/system-bus/src/inngest/functions/heartbeat.tspackages/system-bus/src/inngest/functions/check-system-health.tspackages/system-bus/src/inngest/functions/content-sync.ts
- Add event emission at ingress/egress in
- Gateway:
- Emit drain / queue / send outcomes from:
packages/gateway/src/channels/redis.tspackages/gateway/src/command-queue.tspackages/gateway/src/channels/telegram.tspackages/gateway/src/daemon.ts
- Emit drain / queue / send outcomes from:
- Add backpressure + drop protections (sampling for debug-level chatter) in
packages/system-bus/src/observability/store.ts. - Rollout gate: event volume < configured threshold, no queue regressions on gateway.
Phase 2: Query Surfaces (week 1)
- Web API:
- Add
apps/web/app/api/otel/route.ts(new) for typed query/filter overotel_events.
- Add
- CLI:
- Add
packages/cli/src/commands/otel.ts(new) with:joelclaw otel listjoelclaw otel searchjoelclaw otel stats
- Register command in
packages/cli/src/cli.ts.
- Add
- Add schema docs and examples under
apps/web/content/adrs/0087...+ runbook. - Rollout gate: on-call triage possible from CLI alone (no direct DB access).
Phase 3: UI and Design System (week 2)
- Expand shared UI in
packages/ui/src/:status-badge.tsx(new)metric-card.tsx(new)event-timeline.tsx(new)filter-chips.tsx(new)
- Add observability pages:
apps/web/app/system/page.tsx(new)apps/web/app/system/events/page.tsx(new)
- Migrate existing pages to shared components:
apps/web/app/syslog/page.tsxapps/web/app/network/page.tsxapps/web/app/dashboard/page.tsx
- Rollout gate: mobile render + auth checks + query latency SLOs met.
Phase 4: Agent Loop and Escalation (week 2-3)
- Add error-rate evaluator function in
packages/system-bus/src/inngest/functions/check-system-health.ts(or dedicatedcheck-otel.tsnew function). - Wire gateway notification path through:
packages/system-bus/src/inngest/middleware/gateway.tspackages/gateway/src/commands/telegram-handler.ts
- Fatal-event fast path bypasses normal batching and posts immediate Telegram alert.
- Rollout gate: synthetic fatal event reaches Telegram within SLA.
Phase 5: Sentry Integration (optional, parallel)
- Web SDK integration:
apps/web/sentry.client.config.ts(new)apps/web/sentry.server.config.ts(new)apps/web/sentry.edge.config.ts(new)apps/web/next.config.jsSentry plugin wiring
- Worker/gateway optional
@sentry/nodeinit in:packages/system-bus/src/serve.tspackages/gateway/src/daemon.ts
- Keep Sentry as secondary sink: do not replace Typesense/Convex ingestion paths.
- Self-hosted Sentry only after separate infra ADR addendum with host sizing, backup, upgrade, and rollback runbook.
Verification
- Typesense
otel_eventscollection is auto-created on first write and accepts worker + gateway instrumentation events. -
joelclaw otel list,joelclaw otel search, andjoelclaw otel statsare implemented and wired into the CLI. - Warn/error/fatal events mirror to Convex
contentResourcesasotel_eventfor real-time UI surfaces. - Convex rolling window is enforced with
OTEL_EVENTS_CONVEX_WINDOW_HOURS(default0.5= 30 minutes) and opportunistic prune on high-severity writes. -
/systemrenders mobile-first health summary + event feed from the new/api/otelAPI. -
/system/eventssupports full-text search and facet filters forsourceandlevel. - Heartbeat/system health now queries
otel_eventsfor recent error-rate escalation. - Fatal path uses
immediateTelegramsignaling and bypasses normal batch digest delay in gateway Redis drain. -
packages/uishared components (status-badge,metric-card,event-timeline,filter-chips) are consumed by/system,/syslog,/network, and/dashboard. - No Grafana dependency was added.
- Canonical write path is centralized in
packages/system-bus/src/observability/{otel-event.ts,emit.ts,store.ts}. - Sentry remains optional + secondary (
SENTRY_DSN/SENTRY_ENVIRONMENT), and no self-hosted Sentry infra was added.
Implementation Outcome (2026-02-21)
Completed
- Implemented canonical observability contract and storage adapters under
packages/system-bus/src/observability/. - Added worker ingest endpoint (
/observability/emit) so gateway emissions go through the single worker write path. - Instrumented worker and gateway hot paths listed in this ADR with debug flood guardrails.
- Added owner-authenticated web query API at
apps/web/app/api/otel/route.ts. - Added CLI surface
joelclaw otel {list|search|stats}. - Added
/systemand/system/eventspages and reused shared UI components across existing pages. - Wired error-rate escalation in
check-system-healthwith immediate fatal Telegram path.
Exact Paths Touched
packages/system-bus/src/observability/otel-event.tspackages/system-bus/src/observability/emit.tspackages/system-bus/src/observability/store.tspackages/system-bus/src/observability/otel-event.test.tspackages/system-bus/src/observability/store.test.tspackages/system-bus/src/lib/typesense.tspackages/system-bus/src/serve.tspackages/system-bus/src/inngest/functions/index.tspackages/system-bus/src/inngest/functions/observe.tspackages/system-bus/src/inngest/functions/heartbeat.tspackages/system-bus/src/inngest/functions/check-system-health.tspackages/system-bus/src/inngest/functions/content-sync.tspackages/gateway/src/observability.tspackages/gateway/src/channels/redis.tspackages/gateway/src/command-queue.tspackages/gateway/src/channels/telegram.tspackages/gateway/src/daemon.tsapps/web/app/api/otel/route.tspackages/cli/src/commands/otel.tspackages/cli/src/cli.tspackages/ui/src/status-badge.tsxpackages/ui/src/metric-card.tsxpackages/ui/src/event-timeline.tsxpackages/ui/src/filter-chips.tsxapps/web/app/system/page.tsxapps/web/app/system/events/page.tsxapps/web/app/syslog/page.tsxapps/web/app/network/page.tsxapps/web/app/dashboard/page.tsxapps/web/components/site-header.tsxapps/web/components/mobile-nav.tsxapps/web/app/api/search/route.tsapps/web/app/api/typesense/[collection]/route.tsapps/web/app/api/typesense/[collection]/[id]/route.ts
References
- ADR-0006: Prometheus + Grafana (superseded — wrong era, wrong stack)
- ADR-0033: VictoriaMetrics + Grafana (superseded — Grafana unnecessary, joelclaw.com is the surface)
- ADR-0082: Typesense unified search (storage backend for events)
- ADR-0084: Unified contentResources (Convex real-time layer)
- ADR-0085: Data-driven network page (pattern for ISR + Convex Server Components)
- ADR-0075: Better Auth + Convex (owner-only auth for dashboards)
- shadcn registry docs
- Vercel composition patterns skill
- Sentry self-hosted docs
- Sentry self-hosted releases/upgrades
- Sentry Node OpenTelemetry support
Notes
Q&A (Joel, 2026-02-21)
- Biggest pain: Silent failures. System fails quietly, discovered by accident hours later.
- Consumer: Agent-first. Self-diagnose + auto-fix. Escalate exceptions to Joel.
- Surface: joelclaw.com. Mobile-first. Next.js cached components. No Grafana.
- Scope: Full stack. Structured logs. High cardinality. Plenty of storage — use it.
- Design system: shadcn registry in monorepo. Consistent component library across all pages.