When the LLM Points — Embedding Screen Coordinates in Streaming AI Responses

2026-04-08·https://github.com/farzaa/clicky

repoaimacosswiftllmscreen-capturevoiceeducationopen-source

The [POINT:x,y:label:screenN] tag pattern is a clean way to turn structured LLM text output into physical UI gestures — relevant to any agent surface that needs to reference specific elements on screen

Farza built Clicky as a macOS menu bar app that watches your screen, listens to your voice, and talks back — like a real tutor sitting next to you. Push to talk via Control+Option, it captures a screenshot and transcript, ships both to Claude via streaming SSE, and plays the response through ElevenLabs TTS. Standard multi-modal pipe. The clever bit is the pointing.

Claude can embed [POINT:x,y:label:screenN] tags in its responses. A transparent full-screen NSPanel overlay parses those tags out of the stream and flies a blue cursor to those coordinates — across multiple monitors. The model doesn’t “control” your cursor; it annotates its own text output with spatial references that a simple overlay renders. Clean separation: Claude generates structured text, the app interprets it as gesture. No tool calls, no computer-use API, just text parsing.

The whole thing proxies through a Cloudflare Worker to keep API keys out of the binary — three routes: /chat, /tts, and /transcribe-token for AssemblyAI websocket auth. The Swift project is canonically named leanring-buddy — the typo is intentional per the author, long story. It’s genuinely open and built to be hacked: the README leads with a Claude Code prompt to clone and set up the whole thing, and there’s an AGENTS.md in the repo for agent context. Farza built this to be developed by AI from the jump.

Worth watching as a pattern for any agent UI that needs spatial grounding. “Look at this thing on screen” is a natural teaching gesture that current tool APIs make cumbersome — the computer-use path is heavy. Embedding coordinates as structured text output that a thin overlay layer renders is a much simpler path to the same effect.

Key Ideas

[POINT:x,y:label:screenN] tags in Claude’s streaming SSE response drive a transparent cursor overlay — LLM spatial gestures as structured text, no special tool calls required
Full-screen transparent NSPanel renders pointing without touching the real cursor — non-invasive, elegant, and monitor-aware
Cloudflare Worker proxy pattern: API keys never ship in the binary, Worker holds all secrets, app just talks to a worker URL you own
Push-to-talk streams audio to AssemblyAI via websocket for real-time transcription, then screenshot + transcript → Claude → ElevenLabs TTS for full voice loop
Uses ScreenCaptureKit (macOS 14.2+) — modern screen capture API, not the legacy path
AGENTS.md in repo for Claude Code context — built to be extended by AI agents from the start
The original tweet apparently went wide; the open-source release is for people who want to build their own features

Links

← All discoveries