Partial Evidence Is Not Proof

decision-making evidence instrumentation measurement performance Jun 30, 2026

After two structural fixes had landed, the honest answer to "is this fast enough now?" was still: we don't actually know. The fixes looked right. Nothing obviously misbehaved. But "it feels better" is a feeling, not a finding. The next piece of work wasn't another fix — it was figuring out whether a fix was even warranted, and what it should target.

That reframing is the whole lesson. The most useful thing you can do after a round of fixes is sometimes to build nothing and measure instead. And a measurement task has its own discipline, separate from the discipline of building things.

The problem

Once you've changed a system a couple of times, you accumulate a comfortable but unearned confidence. The code reads better. The obvious bottleneck is gone. So the temptation is to either declare victory or jump straight into the next optimization — adding an index, widening a cache, tightening a query — on the strength of suspicion.

The problem is that suspicion and relief are both narratives, not evidence. You can't tell whether a remaining problem exists, how large it is, or where it lives, until you watch the system actually run. Optimizing on a hunch is how you spend real effort hardening a path that was never the bottleneck.

What actually happened

We treated the confidence question as a measurement task with an explicit contract: count the relevant operations, time them, attribute each one to a cause, watch for anything writing where it shouldn't, and produce a recommendation — without changing any behavior.

The results were genuinely useful and genuinely incomplete. The frequently triggered path turned out to be bounded in every scenario we measured: predictable, single operations per action, no surprise cascades. That narrowed our uncertainty in a real way.

But several scenarios were never exercised, and the deeper cost analysis — the evidence you'd actually need to justify a database index — was never collected. So the measurement could support one conclusion ("the common path looks bounded in these cases") and could not support another ("the system is fully settled" or "we should add an index"). We wrote down both halves.

The quietly important outcome was that the temporary measurement code came out cleanly. We confirmed the system's behavior matched its pre-measurement baseline exactly: the timing-sensitive constants unchanged, the core logic unchanged, and no trace of the probe left in the running code.

The lesson

A measurement task can succeed even when its evidence is partial — but only if two things hold. The gaps have to be explicit, and the conclusions can never be upgraded past what was actually observed.

The failure mode is subtle. Nobody decides to lie. What happens is that "partial evidence" quietly becomes "complete proof" as a result travels through summaries, status updates, and someone else's memory of the conversation. Each retelling rounds up. By the time a decision is made, "the common case looked fine in the cases we checked" has become "the system is fine," and the missing scenarios have vanished from the story.

The fix is to make the limits a first-class part of the deliverable, written in the same place as the findings, in language strong enough to survive being summarized: this is what we measured, this is what we did not, and this is the specific conclusion the evidence is and is not allowed to support.

The broader principle

Three ideas generalize well beyond any one investigation.

Confidence is a measurement task, not an implementation task. "Is it good enough now?" deserves its own deliberate work with its own contract, separate from the work of building the fix. Deciding up front whether partial measurement is acceptable — and what it would and wouldn't let you conclude — is part of scoping it.

Expensive decisions need their own evidence. Adding an index, restructuring storage, or changing a hot path is costly and hard to reverse. Don't make that call from static suspicion or from a single convenient timing sample. Collect the evidence that actually speaks to the cost, or explicitly defer the decision into its own piece of work. Measuring one layer (how often something runs) is not the same as measuring another (what each run truly costs).

Temporary instrumentation is a boundary risk you have to retire on purpose. Code you add to observe a system can touch the very surfaces you care most about. Treat it as temporary by construction: off by default, behavior-neutral, named so you can find every instance, removed in a deliberate order, and — this is the part people skip — proven gone by comparing the system against its pre-measurement baseline. "I think I took it all out" is not removal.

There's a fourth, quieter principle about environments. Evidence is only as good as where you gathered it. If the approved place to measure is one environment but the convenient place is another, the mismatch doesn't void the result — but it does downgrade it, and you have to say so. Confirm the environment, the deployed instrumentation, the access, and the redaction rules before you start, or you'll discover afterward that your "production-grade" evidence was really a local smoke test.

How to apply it

  • After a round of fixes, ask whether the next item should be a fix at all. If you can't point to evidence of a remaining problem, the next item is measurement.
  • Give the measurement its own contract: what you'll measure, how you'll attribute each result, which environment is authoritative, and — explicitly — what conclusions the results may and may not support.
  • Decide before you start whether partial evidence is acceptable, and write down what partial results would and wouldn't let you conclude.
  • Keep the limits next to the findings, phrased strongly enough to survive summarization. Assume the caveat will be dropped unless it's hard to drop.
  • Don't authorize costly, hard-to-reverse changes from suspicion or a single timing sample. Require evidence aimed at the actual cost, or defer the decision.
  • Make removable instrumentation removable by construction, and prove it's gone with a baseline comparison and a search for its name — don't trust your memory of cleaning up.
  • Confirm the measurement environment, deployed probes, access, and redaction rules before the run, not after.

None of this proves a system is perfect. It does something more durable: it stops premature fixes, keeps boundaries intact, preserves the risks you haven't retired, and forces honest conclusions. Evidence-led work is slower to claim victory and much harder to mislead.