Skip to content

Benchmark Comparison Template

Harness A:

  • completion rate
  • average retries
  • bugs caught before human review

Harness B:

  • completion rate
  • average retries
  • bugs caught before human review

Interpretation:

  • Which harness changed the result?
  • Which harness changed the cost of getting the result?