Real-time AI manager and coach

Built on LiveKit for real-time audio and video transport. Claude Sonnet 4.6 runs the conversation, and a Computer Use loop gives the model live vision over the user's screen so it can describe what is happening and point at specific elements as the user works. The pointing runs through a coordinate-mapping formula written for this system: the model emits a target in screenshot space (the scaled-down image it actually saw), and the frontend scales it to the user's real display, subtracts the window position and browser chrome, and clamps to the live viewport, so the AI's 'look here' lands on the right pixel even after the user resizes the window or moves it across screens. The key trick is that the AI never has to know the user's display, window position, or chrome: it only ever emits a pixel inside the screenshot it just saw, and the frontend pulls the live browser values at the moment of pointing to bridge the three coordinate spaces. That same formula is what the core UI still uses today. The agent decides when to speak instead of waiting for the user to ask, so silence reads as a signal to step in rather than a pause to wait through. Alongside the conversation, the engine carries the stage's deliverable spec and watches the work take shape against it, surfacing the right resource at the right moment instead of dumping everything up front, and quietly noting which parts of the deliverable are done, missing, or off-target. ElevenLabs voices each persona, with a streaming text-to-speech path that starts generating the first words while the model is still finishing its sentence, getting first audio under 500 milliseconds. When the session ends, the engine fans the transcript out to performance scoring and conversation memory, so what happened in the session shapes how the platform reads the user from that point on.

Coordinate-mapping formula: bridges three coordinate spaces (AI screenshot → user display → browser viewport → normalized $[0, 1]$ ) at the moment of pointing.

Step 1. Screenshot pixel $(s_x, s_y)$ to real-display pixel $(d_x, d_y)$ , scaling by the ratio of the user's actual display to the screenshot the AI saw:

d_x = s_x \cdot \frac{W_{\text{display}}}{W_{\text{screenshot}}}, \qquad d_y = s_y \cdot \frac{H_{\text{display}}}{H_{\text{screenshot}}}

Step 2. Real-display pixel to browser-viewport pixel $(b_x, b_y)$ . Subtract the window's position on the display, and the browser chrome (the title bar + borders that wrap the page). The chrome width splits left + right, the chrome height sits entirely at the top:

C_x = W_{\text{outer}} - W_{\text{inner}}, \qquad C_y = H_{\text{outer}} - H_{\text{inner}}

b_x = \mathrm{clamp}\!\left(d_x - X_{\text{win}} - \tfrac{C_x}{2},\; 0,\; W_{\text{inner}}\right)

b_y = \mathrm{clamp}\!\left(d_y - Y_{\text{win}} - C_y,\; 0,\; H_{\text{inner}}\right)

Step 3. Browser-viewport pixel to a window-size-agnostic ratio in $[0, 1]$ , which is what the pointer animation actually consumes:

t_x = \frac{b_x}{W_{\text{inner}}}, \qquad t_y = \frac{b_y}{H_{\text{inner}}}

The clamp on step 2 makes off-screen targets land at the viewport edge instead of vanishing. Normalization on step 3 keeps the pointer correct after any resize. The AI is decoupled from the user's display setup completely: it only sees $(s_x, s_y)$ and the screenshot it produced, never the live values $W_{\text{display}}$ , $X_{\text{win}}$ , $C_y$ , etc., which the frontend reads at the moment of pointing.