Benchmark Comparison Template
Harness A:
- completion rate
- average retries
- bugs caught before human review
Harness B:
- completion rate
- average retries
- bugs caught before human review
Interpretation:
- Which harness changed the result?
- Which harness changed the cost of getting the result?