Skip to content

中文版本 →

Code examples: code/ Practice project: Project 04. Runtime feedback and scope control

Lecture 07. Draw Clear Task Boundaries for Agents

What Problem Does This Lecture Solve?

You tell Claude Code to "add user authentication to this project," and it starts modifying the database schema, writing routes, changing frontend components, and — while it's at it — refactoring the error-handling middleware. Two hours later you check: 12 files modified, 800 lines of new code, and not a single feature works end-to-end. This is the most classic failure mode of AI coding agents: overreach and under-finish happening simultaneously.

Anthropic's "Effective harnesses for long-running agents" engineering blog post states clearly: when prompts are too broad, agents tend to "start multiple things at once" rather than "finish one thing first." OpenAI's Codex engineering practices found the same — tasks without explicit scope controls see completion rates plummet. This is not a model problem — it's a harness problem. You didn't draw the boundary.

Core Concepts

  • Overreach: The agent activates more tasks in a single session than optimal. It's quantifiable — doing 5 features with 0 passing end-to-end is overreach.
  • Under-finish: The ratio of tasks that pass end-to-end verification, out of all activated tasks, falls below threshold. Code written but tests not passing is under-finish.
  • WIP Limit (Work-in-Progress Limit): From Kanban methodology. Core idea: limit how many tasks are in-flight at once. For agents, WIP=1 is the safest default — finish one before starting the next.
  • Completion Evidence: The verifiable condition a task must satisfy to move from "in progress" to "done." Without this, agents substitute "the code looks fine" for "the behavior passes tests."
  • Scope Surface: A DAG structure where each node is a work unit and edges are dependencies. States are limited to four: not_started, active, blocked, passing.
  • Completion Pressure: The constraining force the harness exerts through WIP limits and completion evidence requirements, forcing the agent to finish the current task before starting a new one.

Why This Happens

Agents Are Born Wanting to "Do a Little Extra"

LLM training data is full of code patterns where "one PR changes multiple things." A typical GitHub PR might include a feature implementation, a refactoring, and a documentation update all at once. The default behavior the model learns from this distribution is "do all the related things you see."

This is a problem in human projects too — Steve McConnell documented in Rapid Development that scope creep is the leading cause of project failure. But humans at least have the intuition of "I've done enough." Agents have none. Generating the next idea costs the model almost nothing in extra tokens — writing "let me fix this too while I'm here" barely registers — but every additional modification dilutes the agent's attention.

Claude Code's real behavior is telling. Ask it to "add user registration" and it might:

  1. Create a User model
  2. Write the registration route
  3. Notice it needs email verification, so add a mail service
  4. See that passwords need hashing, so bring in bcrypt
  5. Notice the error handling is inconsistent, so refactor the global error middleware
  6. See the test file structure is messy, so reorganize the directory

Six steps later, every one is half-done. No end-to-end verification, complex coupling between the half-baked code, and the next session to pick up the pieces will be completely lost.

The Math of Attention Dilution

This isn't a metaphor — it's math. Assume the agent's context capacity is C and it activates k tasks simultaneously. Each task gets an average of C/k reasoning resources. When C/k drops below the minimum threshold needed to complete a single task, none of them get finished.

Anthropic's experimental data directly supports this: agents using a "small next step" strategy (equivalent to WIP=1) show a 37% higher task completion rate than agents using broad prompts. More interestingly, the number of lines of code generated by agents is weakly negatively correlated with actual feature completion — more code written, fewer features completed.

Overreach and Under-finish Are Symbiotic

These two problems aren't independent — they amplify each other. Overreach dilutes attention, diluted attention causes under-finish, and the half-finished code left behind increases system complexity, which further drives overreach in the next task.

In Kanban terms: Little's Law tells us L = lambda * W. If work-in-progress L is too high (doing too many things at once), the lead time W for each task inevitably increases. For agents, this means each feature takes longer from start to verified completion, and the probability of failure grows.

How to Do It Right

1. Enforce WIP=1

This is the most direct and effective method. In your harness, tell the agent explicitly: only one task is allowed in "active" status at any time. In Claude Code's CLAUDE.md or Codex's AGENTS.md, write:

## Work Rules
- Work on one feature at a time
- Only start the next feature after the current one passes end-to-end verification
- Don't "also refactor" feature B while implementing feature A

2. Define Explicit Completion Evidence for Every Task

Done is not "code is written" — it's "behavior verification passes." In your feature list, every entry needs a verification command:

F01: User Registration
  Verification: curl -X POST /api/register -d '{"email":"test@example.com","password":"123456"}' | jq .status == 201
  State: passing

3. Externalize the Scope Surface

Use a machine-readable file (JSON or Markdown) to record all task states. Any new session can read this file and immediately know: which task is active? What behavior counts as done? What verifications have passed?

4. Monitor Verified Completion Rate

The harness should continuously track VCR (Verified Completion Rate) = verified tasks / activated tasks. Block new task activations when VCR < 1.0.

Real-World Case

A REST API project with 8 features, two strategies compared:

Unconstrained mode: Agent activates 5 features simultaneously in session 1. Produces ~800 lines across 12 files. End-to-end test pass rate: 20% — only user registration works. The other 4 features: database schema created but missing validation logic, routes defined but returning wrong response formats. By end of session 3, only 3 of 8 features complete.

WIP=1 mode: Agent works on user registration only in session 1. Produces ~200 lines across 4 files. End-to-end tests: 100% passing. Commits a clean, verified implementation. By end of session 4, 7 of 8 features complete (the 8th blocked by an external dependency).

Result: less total code (800 vs 1200 lines) but more effective code. Completion rate: 87.5% vs 37.5%.

Key Takeaways

  • WIP=1 is the default safe setting for agent harnesses — finish one, then start the next; don't try to parallelize.
  • Completion evidence must be executable — "the code looks fine" doesn't count; "curl returns 201" does.
  • The scope surface must be externalized as a file — not just mentioned in conversation, but recorded in a machine-readable format in the repo.
  • Overreach and under-finish are symbiotic — solving one solves the other.
  • "Do less but finish" always beats "do more but leave half-done" — agent code lines and feature completion rate are negatively correlated.

Further Reading

Exercises

  1. Task Atomization: Pick a broad requirement (e.g., "implement a user management system") and break it into at least 5 atomic work units. For each unit, specify: (a) a single behavior description, (b) an executable verification command, (c) dependencies. Check whether the decomposition satisfies the WIP=1 constraint.

  2. Comparison Experiment: Run the same project twice — once without constraints, once with enforced WIP=1. Compare: verified completion rate, total lines of code, effective code ratio.

  3. Completion Evidence Audit: Review a recent agent run's output, classifying each code change as "completed behavior," "incomplete behavior," or "scaffolding." Add missing verification commands for each incomplete behavior.