Put the Heartbeat at the Seam You Own
Jun 30, 2026
A long-running automated workflow that goes silent looks identical to one that has hung. The fix sounds trivial — "show more progress" — but the interesting decisions are about where the liveness signal lives, how honest it is, and whether your test actually exercises the path a real user hits.
The problem
A workflow spawned worker subprocesses to do its heavy lifting. Some of those workers finished in seconds. Others could run for a long time producing no output at all. From the outside, a slow-but-healthy worker and a dead one were indistinguishable. The obvious framing — "we need more progress output" — hid the real issue: observability coverage was inconsistent. A couple of well-trodden paths already emitted start/finish events. Several other paths, including the long-running ones, could run completely silent.
So the problem was never "add more logging." It was "every long-running seam should prove it is still alive, and right now only some of them do."
What actually happened
The first instinct is to push a liveness wrapper as deep as possible — into the lowest-level function that actually launches a subprocess. Do it once, get it everywhere. But "everywhere" is exactly the problem: that primitive is also used by paths you did not intend to change, so a low-level wrapper quietly widens your behavior into places you never reviewed.
The opposite instinct is just as wrong: wrap only the seams that already had visible progress. That is cheap and it demos well, but it misses the silent long-running workers — the precise cases that motivated the work.
The accepted answer sat in between. A single shared liveness wrapper was added at the specific await points the controller itself owns, and every in-scope spawn was routed through it. Not the lowest primitive, not only the visible seams — the layer where this code is genuinely responsible for waiting on a worker.
Two smaller decisions mattered just as much:
- Truthful failure semantics. The existing result contract could not actually tell a timeout apart from a signal, an output-limit kill, or a generic non-zero exit. The temptation was to render richer terminal messages. The correct move was to surface the existing failure code verbatim and not invent distinctions the data model cannot support.
- Silence is not suppression. A seam that emits nothing because it produces no output is a different thing from a sink that is configured to swallow output. Conflating them leads to "skip the wrapper when there's no output," which is wrong. The wrapper should still run and still emit; whether anything is shown stays a concern of the sink at the boundary.
There was also an event-contract decision. The new lifecycle events overlapped with older start/finish events on a couple of seams. Letting both fire would have produced duplicate terminal lines. So the new events explicitly superseded the old ones at those seams, while the old event types stayed in the model for compatibility. You have to decide — out loud — whether new events replace, coexist with, or preserve the old ones.
The lesson
The change that finally passed review was bounded and honest: one shared wrapper, placed at the seams the controller owns, emitting explicit lifecycle events, kept truthful to the failure data that actually existed, and routed through every in-scope path rather than only the visible ones.
The test lesson was sharper. An early version was "green" but vacuous — it checked the rendering layer without ever reaching a real wrapped seam. A passing test that never drives the real path proves nothing. The version that earned trust drove the actual command path to a real long-running seam, proved the heartbeat appeared on the error stream, and proved the data stream stayed clean. The single most valuable test was the one that would have failed against the old silent behavior.
The broader principle
Observability belongs at the seam you own — not pushed down into shared primitives where it changes paths you never reviewed, and not stapled onto only the seams that were already visible. Keep the signal truthful to the data you actually have, and prove it against the real path, not a stub.
There is one more boundary worth naming. Making a missing dependency visible is not the same as preventing it. Once workers report their own liveness, a missing tool finally shows up instead of hanging — which is a real improvement, and also a different problem from checking for that tool before the run starts. Don't let "now we can see it fail" masquerade as "now it can't fail."
How to apply it
- Enumerate every affected seam before writing anything — start from the full set of paths, not the visible entry point.
- Decide explicitly whether new progress or lifecycle events replace, coexist with, or preserve the old ones. Don't let two of them describe the same moment.
- Keep failure messages truthful to the data model. Don't render distinctions (timeout vs. signal vs. kill) the contract can't actually express.
- Keep suppression a concern of the sink. The wrapper should run and emit even when the sink shows nothing; "no output" and "suppressed output" are not the same condition.
- Test the full matrix: success, returned failure, thrown failure, timer or resource cleanup, suppression, stream separation, and real-path reachability.
- Add one test that would fail against the old silent behavior. That is the test that proves you changed anything.
- Treat "the failure is now visible" as separate work from "the failure is now prevented." Visibility during the run and a preflight check before it are two different layers.