Your AI Teacher Should Be Able to Point

repoaimacosswiftclaudeteachingvoicescreen-captureinterface-patternsopen-source

The [POINT:x,y] coordinate tag protocol — model output driving spatial UI behavior — is a pattern worth stealing for agent overlay work.

Clicky is a macOS menu bar app built by Farza that puts an AI teaching assistant next to your cursor. It can see your screen, hear your voice, talk back, and point at specific UI elements while it’s explaining things. The original demo tweet apparently blew up, and now it’s open source.

The architecture is a clean stack: push-to-talk feeds audio over WebSocket to AssemblyAI for real-time transcription, the transcript plus a screenshot goes to Claude via streaming SSE, and the response plays through ElevenLabs TTS. API keys never touch the app binary — they live in a Cloudflare Worker that acts as a thin proxy. Two NSPanel windows handle the UI: a control panel dropdown for the menu bar, and a full-screen transparent overlay that drives the cursor.

The clever part is how pointing works. Claude doesn’t just describe what to click — it embeds coordinate tags like [POINT:x,y:label:screenN] directly in its response text. The overlay window parses those tags and flies the cursor to the specified position on any connected monitor. The model is doing spatial reasoning and expressing it as structured annotations mixed into natural language output, and the client side renders that as gesture. That’s a genuinely interesting protocol — model-embedded coordinates as a UI primitive.

The project folder is named leanring-buddy (typo, intentionally kept, apparently there’s a story). The CLAUDE.md is the canonical architecture doc written for agents to read first. The recommended setup path is literally “open Claude Code, paste one prompt, let it walk you through Xcode and Wrangler.” Which is exactly how this kind of repo should work now.

Key Ideas

  • [POINT:x,y:label:screenN] protocol — Claude embeds screen coordinates directly in response text; the overlay window parses them and moves the cursor. Model output drives spatial UI behavior.
  • Cloudflare Worker as API key firewall — a thin proxy at workers.dev holds Anthropic/AssemblyAI/ElevenLabs keys so they never ship in the app binary. Three routes: /chat, /tts, /transcribe-token.
  • Full-screen transparent NSPanel overlay — not a HUD, not a sidebar. A zero-interaction-surface window that sits above everything and just moves the cursor indicator around.
  • Push-to-talk + screen capture + TTS = teacher modality — the combination of voice in, vision, and voice out creates something qualitatively different from a text chat assistant. You’re not typing; you’re talking to something that can see what you’re looking at.
  • ScreenCaptureKit on macOS 14.2+ — screenshot is taken at the hotkey moment and attached to the Claude request. The model gets temporal context (what you were looking at when you asked).
  • CLAUDE.md as agent-first docs — architecture is documented for agents to read, not humans to skim. The setup path assumes Claude Code as the primary onboarding tool.
  • Menu bar only, no dock icon — lives in NSPanel, stays out of the way until invoked. The companion pattern, not the app pattern.