Back to projects
Active

Golden Corpus

A curated evaluation dataset for testing AI agent reliability on real-world coding tasks

TypeScriptClaude CodeYAMLMDX

A structured collection of real-world coding tasks used to evaluate how reliably AI agents handle common software engineering work. Each task is atomic, well-defined, and drawn from actual project history.

The Problem

AI coding agents are increasingly capable, but measuring their reliability on everyday tasks is hard. Benchmarks tend to focus on algorithmic puzzles or isolated functions. They rarely test the messy reality of working in a real codebase: editing MDX frontmatter, fixing broken builds, adding pages to an existing site.

The Approach

Golden Corpus takes a different path:

  • Real tasks from real projects — every evaluation item comes from actual work done on this website or similar codebases
  • Atomic scope — each task is small enough to complete in a single agent session
  • Deterministic verification — success criteria are concrete (build passes, file exists, content matches)
  • Progressive difficulty — from simple file edits to multi-file feature additions

What It Tests

Content Operations

  • Creating and editing MDX files with correct frontmatter
  • Fixing common content issues (date formats, tag casing, missing fields)

Code Changes

  • Adding new pages to an existing Next.js app
  • Modifying components without breaking existing functionality
  • Following established patterns and conventions

Project Hygiene

  • Commit message quality
  • Scope discipline (not over-engineering)
  • Respecting existing code style

Why This Matters

If you're building with AI agents, you need to know what they get right and where they fail. Synthetic benchmarks don't tell you whether an agent can add a blog post to your site without breaking the build. Golden Corpus does.

Built With AI

The evaluation framework itself was built using Claude Code — testing the tool with the tool, iterating on task definitions until they reliably distinguish good agent behavior from bad.