Evaluation & calibration

Eval observability: run log and cohort analytics

Two observability tools that watch every evaluation the platform runs: a per-call run log and a cohort analytics view that reads the log to surface where the curriculum is breaking down.

What it is

Two observability tools for the platform's evaluation surface. The eval run log captures every eval call as a row: inputs (rubric, submission, prior context), output (verdict and structured feedback), the model used, the prompt version, the latency. The eval analytics view is a cohort-level dashboard that reads from the same log to surface what is happening across many users at once: which stages are passing too easily, which are stalling everyone, where verdicts are drifting away from human review.

What it's for

When a learner's verdict feels off, ops needs to pull the exact eval run, see what the model was given, see what it returned, and replay it. When the curriculum starts breaking down (everyone failing this stage, or everyone passing it without effort), ops needs to see that before the cohort drops out. The two tools answer different questions about the same surface: 'what happened on this specific call?' and 'what is happening across all calls?' Sharing one source means an alert in analytics drills down to a specific run with one click.

How it was built

The run log is a Postgres table written from every eval call, capturing the full input and output snapshot, the model, the prompt version, the latency, and the source touchpoint. The analytics view is a React page on the ops dashboard that fires parallel aggregate queries (KPI cards, daily trends, per-stage breakdowns, issue distributions, model-cost cuts) over the run log table, all of which return inside one network round trip. The two tools share their schema so a row that looks suspicious in the cohort view links straight to its full input, output, and prompt diff on the per-run view.

My role

Major contributor on the per-eval capture path, the model and prompt-version tagging on every row, and the analytics queries the ops dashboard fires in parallel.

Built with

ReactTypeScriptPostgresSupabaseRechartsPython

Want the full technical depth, the tradeoffs, what broke, what I'd do differently? Ask the agent about this project.

More projects Talk through it