Agentic & multimodal systems

Tool-using agents in simulated apps

Simulation runtime that can stand up any software tool and run an agent or a user through it, with specific understanding checkpoints baked into every task.

What it is

A simulation runtime for any software tool. An AI agent (or the user being shadowed by one) completes multi-step tasks inside the real interface, clicking, typing, and submitting the way a person would. Every simulation also carries the specific understanding checkpoints the user is supposed to hit on this tool, so the platform reads not just what they did but what they were supposed to learn from doing it.

What it's for

Practicing on real software is the highest-fidelity way to learn it, and most AI agents that operate software confidently click buttons that aren't actually there, which breaks the moment they hit a real interface. The hard problem is keeping any operator (agent or person) tied to what genuinely exists on the page, while still knowing what the user was supposed to understand from working it. This solves both at once, so the simulator runs reliably on real tooling instead of only in a demo, and the platform can tell whether the user actually got the point or only clicked through.

How it was built

The runtime reads the live page's structure to build a verified list of every element that can be touched, so it works on any web-based tool without per-tool wiring. On each turn the agent (or the user being watched) runs through a short loop: look at the page, plan the next move, verify the move maps to a real element, act, then check the result. The full pipeline strings eight of these checks end to end so the operator stays inside reality even as the page changes underneath them. Before any click or keystroke goes through, the action is matched back to a real element on the page, so the agent cannot fire on something it imagined. Each simulation also carries the understanding checkpoints the user is supposed to hit on this tool, so the platform watches whether the user actually grasped what the step was teaching, not just whether they completed it. Built on Playwright for live browser control, BeautifulSoup for parsing the page structure, and Gemini for the agent loop.

My role

Co-built. Owned the generation pipeline and the inference runner.

Built with

PythonTool-use / function callingBeautifulSoupPlaywrightWebSocketsGemini

Want the full technical depth, the tradeoffs, what broke, what I'd do differently? Ask the agent about this project.

More projects Talk through it