Evaluation & calibration

Pre-stage understanding gates

Two quick understanding checks that catch a misread of the assignment before it turns into days of wrong work.

What it is

Two understanding checks that run before the user starts each stage. The note check scores their notes against the canonical points of the brief. The kickoff playback asks them to say back the stage's objective in their own words right before they dive in. Both write into the calibration substrate so the platform knows how clearly the user entered the stage.

What it's for

A misread of the brief turns into days of building the wrong thing. Catching it before the work starts is much cheaper than catching it after. The note check runs while the user reads the brief, scoring how well their note-taking matches the canonical points. The kickoff playback runs right before they start, asking them to say the stage's objective in their own words so the platform can hear whether they actually got it. If the playback is vague, the platform knows to push back instead of letting them dive in already lost.

How it was built

Both run as FastAPI WebSocket endpoints with stateless Gemini agents underneath. The note check scores each user-written bullet against the canonical points from the stage script and returns matched, unmatched, and an understanding percentage. The kickoff playback is intentionally thin: no coaching scaffolding around the model so it cannot soften its own verdict when the user gives a vague answer. Both checks write their score into the calibration substrate, so the next heavy completion knows how clearly the user entered the stage and can weigh independence and quality against that baseline.

My role

Sole author of both checks.

Built with

PythonFastAPIWebSocketsGeminiPer-segment evaluationStateless agent

Want the full technical depth, the tradeoffs, what broke, what I'd do differently? Ask the agent about this project.

More projects Talk through it