Model Output Failures Are Contract Problems, Not Formatting Bugs
Jul 05, 2026
When a worker occasionally returns malformed Markdown, the visible bug looks like a formatting mistake. It usually is not. The durable fix is to align the worker prompt, canonical grammar, parser, writer, diagnostics, and documentation as one contract family—and to classify every rejection with a deterministic, operator-readable reason.
The problem
The symptom was a runtime refusal: a structured worker output was rejected as invalid, and the operator could not tell whether the cause was bad JSON, a missing section, extra commentary, an invalid route token, or some other structural defect. Every failure collapsed into the same generic message.
The first instinct is to tighten the parser or add a normalization pass. That treats the bug as local formatting tolerance. In this case, the deeper failure was contractual ambiguity. The worker prompt described one strict canonical document shape. The parser tolerated some noncanonical shapes, including leading prose and extra heading levels. Normalization documentation still described an older field structure that no longer matched what the system actually emitted and parsed.
The item looked like a narrow parser patch. A baseline audit showed it was a protocol-hardening change spanning prompt wording, canonical grammar, parser and writer behavior, adapter failure handling, progress rendering, documentation mirrors, and tests at multiple layers.
What actually happened
Hidden complexity sat inside a lifecycle scope that appeared small from the outside. The important work was deterministic failure classification. Once the parser could return one of many possible rejection reasons, the specification had to define exact first-failure order. Mixed-error cases did not always follow the accepted precedence on the first build pass, which required rework before the implementation could be accepted.
The eventual fix, within accepted scope, established:
- one exact-shape Markdown contract
- a closed rejection-reason set with stable identifiers
- deterministic evaluation order for overlapping defects
- case-sensitive field labels
- diagnostic previews bounded to a fixed Unicode length
- adapter-owned persistence with fail-closed refusal on invalid output
- close-to-real regression coverage on the canonical controller path, not only helper-level parser tests
- stale documentation mirrors aligned to the live contract
The live acceptance run mattered. The original failure occurred on the real interactive operator path, so unit-level parser proof alone was insufficient. At verified head, the run completed worker recommendation, normalization, and reached the next human gate without reproducing the prior canonical-output rejection. That confirmed the fix addressed the actual operator path, not the parser in isolation.
Residual risks were bounded: the exact stdout from the original failed run was unavailable, so regression used synthetic malformed fixtures; live model output can still violate the contract, but such violations are now intentionally rejected with a concrete reason rather than accepted or reported generically.
The lesson
A visible formatting failure may indicate a contract-family problem, not a parser bug. Fix the whole contract before you harden one layer.
When structured worker output fails validation:
- Ask what the sole authoritative result format is. Prompt wording and parser behavior must be treated as one product contract. "Strict Markdown" is not sufficient unless headings, field order, casing, whitespace, extra content, and duplicate handling are fully specified.
- Name the persistence owner. The adapter should remain the sole writer of durable artifacts. The worker should emit stdout only. The correct fix is contract alignment and better diagnostics, not accepting arbitrary output to unblock the flow.
- Define what invalid forms must be rejected. Specify stable rejection categories and add tests for leading prose, trailing prose, code fences, extra headings, duplicate fields, misordered fields, invalid tokens, truncation, and mixed errors.
- Set deterministic first-failure precedence. When more than one validation rule can fail, mixed-error inputs must classify consistently. Build audit should test mixed-failure precedence, not only isolated happy and unhappy cases.
- Prove the canonical path. When legacy or compatibility paths exist, include one close-to-real controller test that proves no artifact or stage advance occurs on invalid output. Preserve a real operator-path smoke when the original defect was observed through CLI use.
- Treat diagnostic reason strings as compatibility surface. Operator-facing rejection identifiers may become part of the long-lived interface and therefore need specification and tests.
A canonical blocked route must be distinguished from informal prose saying that the worker is blocked. Real model-output paths should be tested in addition to parser helpers.
The broader principle
Model-output hardening is rarely a shallow patch. It is a cross-layer contract migration touching prompts, parsers, persistence, diagnostics, documentation, and proof obligations.
A governed delivery workflow adds value here by turning an apparently local runtime failure into a contract analysis. Baseline audit can reveal prompt-parser disagreement, stale documentation mirrors, generic refusal diagnostics, missing negative tests, and shadow-path risk if implementation targets a legacy broker path instead of the canonical controller path.
Specification stages force decisions on exact headings, ordering, whitespace tolerance, blocked-route representation, rejection reason strings, diagnostic caps, and first-failure order. Plan stages preserve implementation order and architectural boundaries—parser and reason semantics before adapter and progress changes. Build audit can reject a superficially complete implementation when mixed-error precedence or regression gates remain defective.
Verification should be layered: focused parser and adapter tests, full suite runs, direct parser probes, acceptance criteria mapped to named observations, close-to-real controller-path proof, and manual acceptance on the same operator path where the original failure occurred.
Separate subprocess completion, semantic validity, durable persistence, and lifecycle authorization. A controlled human-gate pause is a successful lifecycle outcome, not an unknown delivery failure. Facade-level result handling must reflect that distinction.
How to apply it
Before implementation on similar items, answer four questions:
- What is the sole authoritative result format?
- Which component owns persistence?
- What exact invalid forms must be rejected?
- In what deterministic order are multiple defects classified?
Without those answers, a "strict" worker contract can still be internally ambiguous.
Practical checklist:
- Define the canonical output grammar before implementation, including exact ordering and tolerated whitespace.
- Audit prompt, parser, writer, normalization instructions, protocol documentation, and fixtures as one contract family.
- Add a writer-to-parser round-trip test.
- Require parser specifications to define mixed-error precedence whenever more than one validation rule can fail.
- Keep negative tests as explicit as positive ones.
- Do not combine unrelated baseline-audit, normalization, or audit-ownership changes into the same item unless the audit proves they share one implementation boundary.
- Preserve manual operator-path acceptance when the original defect was observed through interactive CLI use.
When output looks malformed, stop patching tolerance. Find where the system decides what valid output means—and align every surface that interprets it before anything downstream treats the result as durable truth.