Langfuse as an LLM-Only Observability Plane
Context
We need deeper observability for LLM usage only (prompt/input-output traces, model latency, token/cost usage, eval workflow), without replacing existing joelclaw observability.
Current state:
- Canonical system observability already exists via ADR-0087 (
otel_eventsin Typesense + Convex/UI/CLI surfaces). - joelclaw runtime is event-first (Inngest + gateway + OTEL events), not APM-first.
- Most LLM calls are made through
pisubprocesses in CLI and worker code, with one direct Anthropic HTTP call intranscript-process.ts. - Current cluster capacity is a single node (
4 CPU,~8 GiB RAM) with running workloads (inngest, worker, redis, typesense, pds, livekit).
The question is not “replace observability,” but “add a dedicated LLM observability plane with strict boundaries.”
Research Summary (top-to-bottom)
1) Langfuse self-host architecture is production-grade but infra-heavy
Langfuse v3 self-host requires:
langfuse-weblangfuse-worker- Postgres (OLTP)
- ClickHouse (OLAP, mandatory)
- Redis/Valkey (queue + cache)
- S3/blob store (event/object persistence)
Key requirement details:
- ClickHouse is mandatory (no Postgres-only mode in v3).
- Redis queue behavior expects
maxmemory-policy noeviction. - For OTEL ingest, Langfuse supports HTTP/protobuf endpoint (
/api/public/otel), not gRPC. - Health endpoints exist for web and worker (
/api/public/health,/api/public/ready,/api/health).
2) Minimum published sizing is above our current node footprint
Langfuse minimum guidance (self-host scaling docs) is roughly:
- Web:
2 CPU / 4 GiB - Worker:
2 CPU / 4 GiB - Postgres:
2 CPU / 4 GiB - Redis:
1 CPU / 1.5 GiB - ClickHouse:
2 CPU / 8 GiB - Blob store: managed S3 or MinIO
This exceeds current control-plane capacity if co-located with existing joelclaw services.
3) Scope fit is strong if we keep strict boundaries
Langfuse is a good fit for:
- generation-level traces
- prompt/version lineage
- model/provider/latency/token/cost visibility
- LLM-focused analysis and eval UX
Langfuse is not needed for:
- infra health
- webhook/gateway plumbing telemetry
- non-LLM pipelines
Those stay in ADR-0087 OTEL/Typesense.
4) Licensing and feature split
- Core Langfuse OSS is MIT with full core tracing APIs.
- Some admin/security features are EE via license key (RBAC expansions, audit logs, server-side ingestion masking, SCIM/org management APIs, etc.).
- LLM-only observability goal does not require EE for initial adoption.
5) Alternatives considered
- Status quo + custom OTEL LLM fields only
- Lowest ops load
- Misses dedicated prompt/eval/tracing workflows
- Self-host Langfuse (chosen)
- Best product fit for LLM usage debugging
- Higher ops load and infra footprint
- Arize Phoenix
- Strong eval tooling, self-hostable
- ELv2 license (different OSS posture from MIT) and less direct fit with current desired product workflow
- LangSmith self-host
- Enterprise-gated self-host model; not aligned with current self-host-first preference
Decision
Adopt Langfuse as a separate LLM-only observability plane with hard boundaries and phased deployment:
- Langfuse is scoped to LLM usage only.
- ADR-0087 OTEL/Typesense remains canonical for system observability.
- No non-LLM spans/events are sent to Langfuse.
- Rollout is phased: hosted Langfuse Cloud first (to start instrumentation now), then full self-host after hardware expansion.
- Self-host phase must not contend with existing single-node control-plane capacity; use dedicated infra (new node or external managed datastore topology).
- All LLM instrumentation must fail-open (Langfuse outages cannot block command/function execution).
Boundary Contract
In-scope for Langfuse
pi-backed inference calls used for triage/rewrite/summarization/classification- direct provider calls (Anthropic/OpenAI/etc.)
- Inngest
step.aimodel invocations where usage is available
Out-of-scope for Langfuse
- gateway queue drain events
- webhook verification events
- k8s/service health checks
- storage/network/infra diagnostics
- generic OTEL event stream mirroring
Correlation fields required on every Langfuse trace
joelclaw.componentjoelclaw.actionjoelclaw.event_id(if present)joelclaw.run_id(Inngest run id when available)joelclaw.session_id(gateway/cli session where applicable)environment(dev/prod)
Implementation Plan
Phase 0 — Infra preflight + deployment topology
- Provision Langfuse on dedicated capacity (not current overloaded control-plane):
- either separate k8s node/namespace (
langfuse), or - managed Postgres/ClickHouse/S3 + dedicated Redis/Valkey with noeviction
- either separate k8s node/namespace (
- Add deployment config in repo:
k8s/langfuse-values.yaml(new)k8s/deploy-langfuse.sh(new)
- Add secret contract docs:
- Langfuse keys/host
- storage/database/redis credentials
Phase 1 — Instrumentation foundation (LLM-only)
- Add shared helper wrappers:
packages/system-bus/src/lib/llm-observe.ts(new)packages/cli/src/llm-observe.ts(new)
- For
pisubprocess paths, switch to--mode jsonin wrappers to capture provider/model/usage/cost from final events. - Emit both:
- Langfuse trace/generation records (LLM plane)
- existing OTEL event summary (
llm.call.completed|failed) for cross-plane diagnosis
Phase 2 — Pilot callsites (high-signal first)
Pilot on:
packages/cli/src/commands/recall.ts(query rewrite)packages/system-bus/src/observability/triage.ts(LLM classifier)packages/system-bus/src/inngest/functions/transcript-process.ts(direct Anthropic vision call)
Phase 3 — Expand to remaining pi callsites
Migrate LLM subprocess callsites in:
packages/system-bus/src/inngest/functions/check-email.tspackages/system-bus/src/inngest/functions/task-triage.tspackages/system-bus/src/inngest/functions/observe.tspackages/system-bus/src/inngest/functions/reflect.tspackages/system-bus/src/inngest/functions/promote.tspackages/system-bus/src/inngest/functions/memory/batch-review.tspackages/system-bus/src/inngest/functions/content-sync.tspackages/system-bus/src/inngest/functions/vip-email-received.tspackages/system-bus/src/inngest/functions/daily-digest.ts(step.ai path)
Phase 4 — Ops + guardrails
- Add health checks and alerts for Langfuse web/worker readiness.
- Add sampling/masking policy (PII-safe) before production rollout.
- Enforce span allowlist (
LLM scopes only) to prevent scope creep. - Document rollback switch:
JOELCLAW_LLM_OBS_ENABLED=0.
Acceptance Criteria
- Langfuse receives traces for pilot LLM callsites with model/latency/token/cost metadata.
- No non-LLM system events appear in Langfuse.
- Existing OTEL pipeline remains unchanged and fully functional.
- LLM call execution remains successful when Langfuse is unavailable (fail-open verified).
- Each Langfuse trace is correlatable to joelclaw run/session/event identifiers.
- Dedicated infra deployment does not degrade existing joelclaw workloads.
Verification Commands
joelclaw statusjoelclaw inngest statusjoelclaw gateway statuscurl -fsS http://<langfuse-web>/api/public/healthcurl -fsS http://<langfuse-web>/api/public/readycurl -fsS http://<langfuse-worker>/api/healthjoelclaw otel search "llm.call" --hours 24
Non-Goals
- Replacing ADR-0087 OTEL/Typesense as system observability source of truth.
- Sending full infra/app spans into Langfuse.
- Re-architecting all model execution into a single gateway in this ADR.
Consequences
Positive
- Dedicated LLM debugging workflow without polluting system observability.
- Better visibility into model usage/cost regressions and prompt behavior.
- Preserves existing joelclaw o11y architecture and CLI surfaces.
Negative / Risks
- Significant infra overhead for self-hosting.
- Requires disciplined scope enforcement to avoid dual-observability sprawl.
- Existing
pisubprocess calls currently hide usage unless migrated to JSON-mode wrapper.
Rollback
- Disable instrumentation via env flag (
JOELCLAW_LLM_OBS_ENABLED=0). - Keep OTEL summaries only.
- Scale down/remove Langfuse deployment after confirming no runtime dependency remains.
References
- ADR-0087: Full-Stack Observability + JoelClaw Design System
- Langfuse self-hosting architecture and deployment docs (
/self-hosting) - Langfuse scaling guide (
/self-hosting/configuration/scaling) - Langfuse ClickHouse requirements (
/self-hosting/deployment/infrastructure/clickhouse) - Langfuse Redis/cache requirements (
/self-hosting/deployment/infrastructure/cache) - Langfuse OTEL ingest docs (
/integrations/native/opentelemetry) - Langfuse health/readiness docs (
/self-hosting/configuration/health-readiness-endpoints) - Langfuse license key split (
/self-hosting/license-key)
More Information
- 2026-02-21: Operator directive changed rollout sequence to hosted-first (Langfuse Cloud) while keeping this ADR’s LLM-only boundary contract intact.
- Self-hosted deployment remains the target state after new hardware capacity is available.
- Secrets for hosted phase were stored via
secretsCLI aslangfuse_secret_key,langfuse_public_key, andlangfuse_base_url. - 2026-02-21: Phase 1 pilot started in
packages/cli/src/commands/recall.tswith Langfuse generation traces for query rewrite (provider/model/usage/cost captured frompi --mode json). - 2026-02-22: Hosted rollout expanded in
@joelclaw/system-buswith shared Langfuse LLM tracing helpers and instrumentation added to major inference paths (observability/triage,check-email,task-triage,observe,reflect,memory/batch-review,content-sync,promote,vip-email-received,daily-digest,transcript-process,media-process,agent-dispatchfortool=pi). - 2026-02-22: Post-rollout validation confirmed new trace names in hosted Langfuse, including
joelclaw.agent-dispatch. - 2026-02-22: Remaining
step.ai.infercallsite inventory in@joelclaw/system-busreduced todaily-digest; callsite now emits Langfuse traces on both success and failure with inferred provider/model and extracted usage token fields when available. - 2026-02-22: Added CI guardrail to prevent untraced
step.ai.inferadditions (scripts/validate-llm-observability-guards.ts, enforced via shared workflow.github/workflows/agent-contracts.yml), enforcing nearbytraceLlmGenerationcoverage. - 2026-02-22: Added
joelclaw langfuse aggregateCLI surface for project-level cloud trace rollups (cost/latency/signature trends) so Langfuse + OTEL + local logs can be queried through one agent-facing CLI.
Status
Accepted.