AST-Based Message Formatting via unified/remark
History of Channel Message Formatting
The Telegram formatting pipeline has evolved through several iterations, each adding complexity to a fundamentally regex-based approach:
5af27e4- Initial Telegram channel: grammY bot with basic md→HTML, chunking, outbound routingfda48b6(ADR-0069) - Improved Telegram formatting + smart notification filtering4fb1959(ADR-0070) - Inline keyboards + callback handler.send()gainsbuttons,silent,noPreviewoptions97df5c1- Fix: escape HTML entities before markdown transforms (first escaping bug)35aa12a(ADR-0104) - Priority message queue, dedup, Telegram HTML validation. AddedisWellFormedTelegramHtml()validator +stripHtmlTags()fallback1150def- Fix: protect existing HTML tags from double-escaping (second escaping bug, today). Added placeholder protection for valid Telegram tags beforeescapeHtml()
Each fix adds another layer of regex protection. The mdToTelegramHtml() function is now ~80 lines of interleaved placeholder extraction, escaping, regex transforms, and placeholder restoration. ADR-0131 (Unified Channel Intelligence Pipeline) will add Slack and Discord channels, each needing their own format rules.
Related ADRs:
- ADR-0069 - Telegram formatting + notification filtering
- ADR-0070 - Telegram Bot API upgrade (inline keyboards, rich send)
- ADR-0086 - Gateway phases 5-9 (outbound routing)
- ADR-0104 - Priority queue, dedup, HTML validation
- ADR-0131 - Unified channel pipeline (adds Slack, Discord)
Reference implementation: vercel/chat (Chat SDK) - packages/chat/src/markdown.ts + per-adapter FormatConverter classes. Uses unified/remark for markdown→mdast parsing, each adapter walks the AST to emit platform-native format. Pattern borrowed, not the dependency.
Context
The gateway currently uses fragile regex-based conversion in mdToTelegramHtml() (packages/gateway/src/channels/telegram.ts). This function:
- Protects code blocks/links with placeholders
- Runs
escapeHtml()on everything (which escapes valid HTML tags from LLM responses) - Applies regex-based markdown→HTML transforms
- Restores placeholders
This just broke - valid HTML tags from LLM responses got double-escaped (commit 1150def fix). The regex approach is inherently fragile: every new edge case requires another placeholder/regex rule.
As joelclaw expands to more channels (ADR-0131: Slack, Discord, iMessage), each will need its own formatting rules. Regex converters per platform don’t scale.
Decision
Adopt the AST-based format converter pattern (inspired by vercel/chat) using the unified/remark ecosystem. Own the code, not the dependency.
Architecture
LLM response (markdown)
→ remark-parse → mdast AST (canonical representation)
→ TelegramConverter.fromAst() → Telegram HTML
→ SlackConverter.fromAst() → Slack mrkdwn
→ DiscordConverter.fromAst() → Discord markdown
→ PlainConverter.fromAst() → stripped text
→ iMessageConverter.fromAst() → plain text (no formatting)Core Interface
import type { Root, Content } from "mdast";
interface FormatConverter {
fromAst(ast: Root): string;
toAst(platformText: string): Root;
extractPlainText(platformText: string): string;
}
// Message type - converters consume this
type PostableMessage =
| string // raw, no conversion
| { markdown: string } // parse → AST → platform format
| { ast: Root } // already parsed, just convert
| { raw: string } // raw, no conversionDependencies (lightweight)
unified- processor pipelineremark-parse- markdown → mdastremark-gfm- GFM support (tables, strikethrough)remark-stringify- mdast → markdown (for round-tripping)mdast-util-to-string- plain text extraction
These are small, well-maintained, already in the JS ecosystem. No framework dependency - just the parser and AST types.
Platform Converters
Each converter walks the mdast tree and emits platform-native formatting:
TelegramConverter (replaces mdToTelegramHtml):
strong→<b>text</b>emphasis→<i>text</i>inlineCode→<code>text</code>code→<pre><code>text</code></pre>link→<a href="url">text</a>delete→<s>text</s>blockquote→<blockquote>text</blockquote>list→• item(Telegram has no list tags)heading→<b>text</b>(Telegram has no heading tags)- Text nodes:
escapeHtml()only on text content, never on tags
SlackConverter (for ADR-0131):
strong→*text*emphasis→_text_delete→~text~link→<url|text>code→`code`blockquote→> text
DiscordConverter:
- Standard markdown passthrough (Discord supports full markdown)
- Additions: spoiler tags, user/role mentions
PlainConverter:
- Strip all formatting, extract text only
Key Design Principle
Parse once, never double-escape. The AST separates structure from text content. escapeHtml() runs only on text node values during Telegram rendering - formatting tags are emitted by the converter, never present in the input text.
This eliminates the entire class of “protect X before escaping, restore after” bugs.
Package Location
packages/gateway/src/formatting/ - not a separate package yet. Contains:
ast.ts- parseMarkdown, stringifyMarkdown, type guards, node constructorstelegram.ts- TelegramFormatConverterslack.ts- SlackFormatConverter (when ADR-0131 lands)discord.ts- DiscordFormatConverter (when needed)plain.ts- PlainFormatConvertertypes.ts- FormatConverter interface, PostableMessage type
Migration
- Add unified/remark deps to gateway package
- Implement TelegramFormatConverter
- Replace
mdToTelegramHtml()calls with converter - Delete the regex-based converter
- Add converters for other platforms as ADR-0131 progresses
Consequences
Easier
- No more double-escaping bugs - structural impossibility
- Each platform converter is testable in isolation with mdast fixtures
- Adding new platforms = one new converter class
- Round-trip capability: platform → AST → any other platform
- LLM responses can use standard markdown - no platform-specific prompting needed
Harder
- unified/remark adds ~5 small dependencies
- Converter implementations need to handle every mdast node type
- AST walking is slightly more code than regex (but much more correct)
- Testing needs mdast fixtures per platform
Inline Platform Validation (added 2026-02-25)
Each FormatConverter includes a validate(output: string): ValidationResult method that lints converted output against platform-specific rules before send. This catches malformed output at the conversion layer instead of discovering it via API 400 errors.
Telegram Validation Rules
| Rule | Severity | Description |
|---|---|---|
no-unsupported-tags | error | Only Telegram-allowed HTML tags |
balanced-tags | error | Every open tag has matching close |
no-nested-pre | error | <pre> cannot nest in <blockquote> or <pre> |
no-nested-links | error | <a> cannot nest inside <a> |
max-length | error | Chunk ≤ 4096 chars |
entity-count | warn/error | Warn >80, error >100 entities |
valid-href | warning | <a> must have non-empty href |
no-empty-tags | warning | <b></b> with no content |
ampersand-escape | warning | Bare & not entity-escaped |
Implementation: single-pass string scan with stack-based tag checker. No DOM parsing - fast enough for every message.
Codex Review (2026-02-25)
Strengths
packages/gateway/src/channels/telegram.tsalready shows strong operational guardrails (isWellFormedTelegramHtml,stripHtmlTags, and fallback send paths), which aligns with ADR-0143’s goal of avoiding silent formatting failures.- The ADR correctly identifies the core fragility in the current converter: escape/regex sequencing causes structural breakage; an AST path is the right long-term fix.
- The planned per-platform converter interfaces in
packages/gateway/src/formatting/match ADR-0131’s trajectory and should reduce regex duplication for Slack/Discord/iMessage. - The design principle of emitting tags from structure and escaping only text nodes directly addresses historical double-escape issues from
mdToTelegramHtml().
Gaps
- Telegram limit handling is not fully robust:
CHUNK_MAX = 4000leaves headroom but chunking is not aware of HTML structure, so tags or entities can be split and become invalid even when pre-chunk validation passes. chunkMessage()is purely length-based and doesn’t account for entity overhead, multiline UTF-8 boundaries, or tag boundaries, creating false negatives/positives near the 4096-char limit.isWellFormedTelegramHtml()checks the whole message before chunking; it does not ensure each chunk stays valid after splitting.mdToTelegramHtml()still depends on nested-regex transforms, so nested markdown or overlapping syntactic forms can mis-convert in ways an AST walk would avoid.- In fallback logic, HTML failures switch to plaintext but still send a single truncated chunk (
slice(0, CHUNK_MAX)) and stop, which can drop content in long messages. - Placeholder protection in existing function relies on special sentinels and can be fragile under adversarial or unusual input despite current success cases.
Risks
- AST migration without parser-aware chunking and validation can introduce a new class of runtime failures on long formatted messages: first-chunk failure then partial fallback behavior.
- Behavioral parity risk is high unless every existing regex edge case is fixture-tested; some weird but relied-on formatting inputs may regress.
- Dependency scoping risk: adding unified/remark packages only where needed (
packages/gateway) is important, but adding them via loose ranges or conflicting versions can create bundle-size and startup overhead, especially in worker processes. - Current ADR migration sequence is conceptual only; no explicit rollback criterion is specified even though Telegram send fallback behavior is already user-visible.
Recommendations
- Make chunking parser-aware: split on safe boundaries (outside tags/entities), and validate each chunk before send; when invalid, either repair chunk boundaries or fall back deterministically.
- Fix fallback behavior to preserve full payload: if HTML parsing fails for a message, continue chunked plain sends of full stripped text rather than slicing one truncated chunk and breaking.
- Add a migration safety harness: dual-run conversion in staging (old regex and AST converter), with fixture coverage for nested emphasis/links/code/blockquote, HTML-in-markdown, emoji/UTF-8, and 4KB+ messages.
- Gate conversion mode via feature flag and emit telemetry (
converter_mode,invalid_html_rate,fallback_rate,truncated_chunks) so rollback can be automatic and observable. - Keep unified/remark dependency scope local to
packages/gatewayand document any parser contract (allowed_nodes) as part of the ADR so converter expectations are explicit before production rollout. - Consider Telegram’s entity-based send API as a strategic alternative in a future phase if HTML parse_mode continues to be a recurring constraint under long-form/complex formatting.
Implementation Log (2026-02-25)
- Package:
@joelclaw/markdown-formatter— 21 tests,TelegramConverterwithconvert(),chunk(),validate() - Gateway wiring:
formatByEnvelope()runs converter behindUSE_AST_FORMATTERenv flag. Auto-validates result, falls back to regex on failure. - Feature flag:
USE_AST_FORMATTER=1to enable. Default off — regex path is production default until validation period completes. - Biome: Package boundary enforced via
noRestrictedImports(ADR-0144)