Real-time AI manager and coach
Real-time voice agent that manages and coaches the user through their work, pulls up resources when needed, and checks the deliverable as it forms.
A real-time AI manager and coach that sits beside the user during the work itself. It talks through the task over voice, watches the screen-share, points at the right thing on screen, surfaces the right resource or reference at the right moment, and keeps a live check on whether the deliverable is taking shape the way the stage asked for.
Some kinds of learning only work when someone is with the user in the moment. A coach who replies the second they stall, who sees what they see, who hands over the right reference at the right time, and who is reading the deliverable as it forms rather than only after it is submitted. This makes that available to every user without putting a human in every seat.
Built on LiveKit for real-time audio and video transport. Claude Sonnet 4.6 runs the conversation, and a Computer Use loop gives the model live vision over the user's screen so it can describe what is happening and point at specific elements as the user works. The pointing runs through a coordinate-mapping formula written for this system: the model emits a target in screenshot space (the scaled-down image it actually saw), and the frontend scales it to the user's real display, subtracts the window position and browser chrome, and clamps to the live viewport, so the AI's 'look here' lands on the right pixel even after the user resizes the window or moves it across screens. The key trick is that the AI never has to know the user's display, window position, or chrome: it only ever emits a pixel inside the screenshot it just saw, and the frontend pulls the live browser values at the moment of pointing to bridge the three coordinate spaces. That same formula is what the core UI still uses today. The agent decides when to speak instead of waiting for the user to ask, so silence reads as a signal to step in rather than a pause to wait through. Alongside the conversation, the engine carries the stage's deliverable spec and watches the work take shape against it, surfacing the right resource at the right moment instead of dumping everything up front, and quietly noting which parts of the deliverable are done, missing, or off-target. ElevenLabs voices each persona, with a streaming text-to-speech path that starts generating the first words while the model is still finishing its sentence, getting first audio under 500 milliseconds. When the session ends, the engine fans the transcript out to performance scoring and conversation memory, so what happened in the session shapes how the platform reads the user from that point on.
Coordinate-mapping formula: bridges three coordinate spaces (AI screenshot → user display → browser viewport → normalized ) at the moment of pointing.
Step 1. Screenshot pixel to real-display pixel , scaling by the ratio of the user's actual display to the screenshot the AI saw:
Step 2. Real-display pixel to browser-viewport pixel . Subtract the window's position on the display, and the browser chrome (the title bar + borders that wrap the page). The chrome width splits left + right, the chrome height sits entirely at the top:
Step 3. Browser-viewport pixel to a window-size-agnostic ratio in , which is what the pointer animation actually consumes:
The clamp on step 2 makes off-screen targets land at the viewport edge instead of vanishing. Normalization on step 3 keeps the pointer correct after any resize. The AI is decoupled from the user's display setup completely: it only sees and the screenshot it produced, never the live values , , , etc., which the frontend reads at the moment of pointing.
Tech lead on the cowork variants. Owned the lifecycle contract, the per-utterance ingest client, the end-of-session fanout, and the coordinate-mapping formula that makes the AI's on-screen highlighting land on the right pixel across any window size (still used by the core UI today).
Want the full technical depth, the tradeoffs, what broke, what I'd do differently? Ask the agent about this project.