Supervision Trees That Diagnose Their Own Failures

2026-02-28·https://github.com/beamlens/beamlens

repoelixirotpbeamaiobservabilityllmmonitoringself-diagnosisskills-architecture

Skill-based LLM monitoring inside OTP maps to Koko's supervision tree and mirrors joelclaw's own skill architecture

Threshold alerts tell you that memory spiked. They don’t tell you why. Beamlens by Bradley Golden drops an LLM directly into your OTP supervision tree as a child process, gives it read access to the BEAM’s internal state — ETS distributions, scheduler utilization, allocator fragmentation, garbage collection stats — and lets it investigate causes instead of flagging symptoms. Production-safe by design: everything is read-only, no side effects, data stays in your infrastructure.

The architecture is a Coordinator-Operator pattern where each monitoring capability is a skill with its own system prompt and snapshot function. Fourteen built-in skills cover BEAM VM health, memory allocators, anomaly detection, ETS tables, GC, log analysis, OS metrics, overload detection, ports, supervisor trees, function tracing via Recon, and system events. Custom skills are a @behaviour implementation — define a system_prompt/0 and a snapshot/0, and Beamlens weaves it into the investigation. The Anomaly skill learns your baseline and auto-triggers investigations when it detects statistical anomalies, rate-limited to prevent runaway LLM costs.

What makes this clever: it uses BAML for type-safe LLM prompts with intent decomposition that separates fact from speculation. The Lua sandbox handles safe execution. Multiple LLM providers are supported — Anthropic, OpenAI, Gemini, Ollama for local, AWS Bedrock, the works. A typical investigation costs one to three cents with Haiku. Early development, v0.3.1, Apache-2.0. The bet here is that LLM reasoning applied to runtime internals that external APM tools can’t see produces better incident diagnosis than any amount of static rules.

Key Ideas

LLM as supervised child process — monitoring intelligence lives inside the app, not outside it, with access to internal state that external tools can’t reach
Skill-based monitoring architecture — each capability (BEAM health, ETS, GC, allocators, anomaly detection) is a pluggable skill with its own system prompt and snapshot, composable at the supervision tree level
Fact vs speculation decomposition — BAML type-safe prompts separate what the LLM observed from what it inferred, keeping diagnosis honest
Auto-trigger on statistical anomaly — the Anomaly skill learns baselines and triggers investigations without human intervention, rate-limited by default
Read-only production safety — no side effects, no mutations, no backend phone-home; data stays in your infrastructure
Lua sandbox for safe execution — execution boundaries that prevent the diagnostic layer from affecting the system it’s diagnosing
Coordinator-Operator pattern — separates investigation orchestration from data collection, with pluggable strategies (AgentLoop vs Pipeline)

Links

← All discoveries