Golden Corpus | Emmanuel Prouvèze

A structured collection of real-world coding tasks used to evaluate how reliably AI agents handle common software engineering work. Each task is atomic, well-defined, and drawn from actual project history.

The Problem

AI coding agents are increasingly capable, but measuring their reliability on everyday tasks is hard. Benchmarks tend to focus on algorithmic puzzles or isolated functions. They rarely test the messy reality of working in a real codebase: editing MDX frontmatter, fixing broken builds, adding pages to an existing site.

The Approach

Golden Corpus takes a different path:

Real tasks from real projects — every evaluation item comes from actual work done on this website or similar codebases
Atomic scope — each task is small enough to complete in a single agent session
Deterministic verification — success criteria are concrete (build passes, file exists, content matches)
Progressive difficulty — from simple file edits to multi-file feature additions

What It Tests

Content Operations

Creating and editing MDX files with correct frontmatter
Fixing common content issues (date formats, tag casing, missing fields)

Code Changes

Adding new pages to an existing Next.js app
Modifying components without breaking existing functionality
Following established patterns and conventions

Project Hygiene

Commit message quality
Scope discipline (not over-engineering)
Respecting existing code style

Why This Matters

If you're building with AI agents, you need to know what they get right and where they fail. Synthetic benchmarks don't tell you whether an agent can add a blog post to your site without breaking the build. Golden Corpus does.

Built With AI

The evaluation framework itself was built using Claude Code — testing the tool with the tool, iterating on task definitions until they reliably distinguish good agent behavior from bad.