Evaluation & calibration

Stage submission grading with anti-gaming

Submission system that extracts the user's artefact, grades it against the stage rubric, blocks resubmit-gaming, and ships back a verdict with a celebratory tagline and an XP award when the user passes.

What it is

A real-time submission system that runs when the user uploads their work at the end of a stage. It extracts the actual content out of whatever they submitted (PDF, image, code file, URL), grades that content against the stage rubric, returns a structured verdict (what landed, what did not, what to push on to exceed expectations), allocates XP if the verdict is a pass, and ships a short celebratory tagline written in the user's own voice. If the user tries to resubmit a failed attempt without actually fixing the underlying problem, the system catches them and blocks the pass flip.

What it's for

Submission is the moment of truth: the user hands over their work and the platform either says yes or no. If grading is wrong the user loses trust. If grading is gamed the user learns the wrong strategy (resubmit blindly until it passes). If grading is slow the user disengages while they wait. So this system has to be fast, has to be right, has to refuse to flip a pass on a faked fix, and the verdict has to feel like a read from someone who actually looked at the work.

How it was built

A FastAPI WebSocket. On each submission, the gate check (is this submission even on-task) and the full evaluation run in parallel through asyncio.gather so the verdict starts shaping the moment the upload lands, then streams back token by token to keep the latency feeling instant. Extraction handles whatever the user submits: PDFs and images go through a multimodal extraction path, code files through structured parsing, URLs through a fetch and readability pipeline. Grading runs through a model chain with one model as the primary and others as fallback so a single provider outage does not stall every submission. The grader returns a structured shape: a status code, what-was-good items, what-needs-fixing items with exact location in the artefact, how-to-exceed-expectations items, an XP value, and on a pass an achievement tagline written as the user would describe their own win. The anti-gaming guard runs alongside: every submission file gets a SHA-256 fingerprint, and on any resubmit the new fingerprint is checked against the prior one and the new issue list is fuzzy-matched against the prior unresolved issues. If the file is essentially the same and the prior issues fuzzy-match the new ones, the pass is forced back to a needs-revision verdict so blind resubmits cannot land a free pass. At session close, the verdict fans out: XP into the experience-points engine, the verdict into the calibration substrate so the next heavy recalibration has it, the achievement tagline into the celebration screen, and the structured issues into the realtime feedback walkthrough so the manager NPC can talk the user through what to fix.

My role

Sole author of the WebSocket evaluator, the extraction pipeline, the rubric grading, the anti-gaming guard, the structured feedback shape, the achievement-tagline generation, and the XP allocation that lands on a passing verdict.

Built with

PythonFastAPIWebSocketsClaudeGeminiOpenAI fallbackSHA-256 fingerprintingFuzzy matchingMultimodal extraction

Want the full technical depth, the tradeoffs, what broke, what I'd do differently? Ask the agent about this project.

More projects Talk through it