SOP: Observability Feedback Loop

Use this SOP when debugging is slow, agents keep claiming success without evidence, or runtime behavior is harder to inspect than the code itself.

Goal

Give the agent a local feedback loop over logs, metrics, traces, and runnable workloads so it can reason from execution, not only from code inspection.

Minimum Stack

application emits structured logs
application emits metrics and traces when feasible
local fan-out or collection layer
query interfaces for logs, metrics, and traces
repeatable workload or user journey to rerun after each change

Execution SOP

Define the golden runtime journeys that matter most.
Add structured logs to startup and the critical path.
Add metrics for latency, failure counts, or queue depth where useful.
Add traces or timing markers for slow or multi-step flows.
Make the signals queryable from the local dev environment.
Give the agent one repeatable workload or scenario to rerun.
Require the loop: query -> correlate -> reason -> implement -> restart -> rerun -> verify.

Debug Session Checklist

What failed?
Which signal proves the failure?
Which layer owns the failure?
What changed after the fix?
Did the app restart cleanly?
Did the same workload pass after rerun?

Definition Of Done

The agent can explain a failure mode from runtime evidence.
The same workload can be rerun after each change.
Restart and rerun are part of the normal task loop.
Reliability signals are documented in docs/RELIABILITY.md.

SOP: Observability Feedback Loop ​

Goal ​

Minimum Stack ​

Execution SOP ​