Validation of non-deterministic systems: passing FDA inspection.

A traditional computerised system returns the same output to the same input. A machine-learned system does not, by design. The validation literature is fifty years old; the systems are five. The reconciliation is operational, not philosophical — and the FDA inspectors who walk in with the 2018 Part 11 mindset are now also reading the 2025 AI/ML guidance.

GAMP 5 second edition (2022) introduced category-specific guidance for AI/ML and the lifecycle V-model assumed by every pharma validation function for thirty years. The lifecycle still works. The system at the bottom of the V is what has changed. A model that produces 99.4% accuracy on its reference test set on Tuesday may produce 99.1% accuracy on Friday because the model has been retrained, or because the input distribution has shifted, or because a probabilistic component has rolled the dice differently. The validation question is not whether the system is identical between Tuesday and Friday. It is whether the system is, with documented confidence, still operating within its qualified envelope.

/ 01What stays the same.

The lifecycle stages are recognisable. URS captures intended use, regulatory context, performance requirements. FRS / DDS document functional and design specifications. IQ verifies that infrastructure is installed correctly — compute, networking, storage, version control of the model artefact, the data pipeline. OQ verifies the operational envelope — the system processes input correctly, integrates with surrounding systems correctly, generates the audit trail correctly. PQ demonstrates sustained performance under representative load. The shape is conserved. The substance changes.

/ 02What changes.

The locked reference test set.

In a deterministic system, a test case has a single expected output. In a non-deterministic system, a test case has an expected distribution of outputs — or, more practically, a population of test cases has expected aggregate metrics with confidence bounds. The locked reference test set is the curated, version-controlled, change-controlled corpus against which every release is evaluated. The test set is itself a regulated record. ALCOA+ applies: attributable (who curated each example), legible, contemporaneous to the curation, original (provenance to source), accurate (verified labels), complete (no silent dropouts), consistent, enduring, available. The reference test set deserves the same care as the GMP batch records.

Acceptance criteria as bounds.

"The system shall return the correct answer" is a deterministic acceptance criterion. The non-deterministic equivalent is "the system shall achieve sensitivity ≥ X and specificity ≥ Y on the reference test set, with a 95% confidence lower bound ≥ Z, with no subgroup falling below W." The criteria are statistical. The bounds are pre-specified. The confidence interval is documented. Submissions and validation reports that omit the confidence bound are taking a position the FDA reviewer will challenge — particularly under the December 2024 PCCP guidance, which is explicit that performance criteria need bounds, not point estimates.

Drift detection as PQ continuation.

Performance Qualification has historically been "demonstrate sustained performance during qualification runs". For an AI system PQ does not end. The deployed system is monitored continuously against drift indicators — input distribution shift, output distribution shift, performance metric trajectories — with thresholds that trigger investigation. The monitoring is the qualification, ongoing. Validation reports that treat PQ as a one-time milestone are missing a discipline that the system requires.

Reproducibility through seeding.

For inference-time stochasticity, the FDA reviewers in 2025 are increasingly asking how reproducibility is demonstrated. Seeded inference, deterministic inference modes, version-pinned dependencies, infrastructure reproducibility. A validation report that asserts "the model is deterministic" without specifying the seeding and version-pinning protocol is asserting something it cannot demonstrate.

The system is not the model. The system is the model plus the data plus the infrastructure plus the operating procedure plus the human. Validate the system.

/ 03The 2025 inspection pattern.

Field experience from FDA inspections through 2025, observed at sponsor sites that have already hosted them, shows three recurring lines of inquiry on AI-touched systems:

  • "Show me the test set used to validate the current production version. Show me how it was assembled. Show me the change-control history of additions and removals from it."
  • "Show me the metric thresholds for triggering corrective action. Show me an example where a threshold was crossed and what the corrective action was."
  • "Walk me through what happens when the model is retrained. Who approves the retraining decision? Who approves the deployment of the retrained model? What is preserved of the prior model?"

The sites that pass these conversations have the documentation in operational form. The sites that fail have the documentation as concept; assembling specifics under inspection time pressure produces 483 observations even where the underlying practice is sound.

/ 04The bioanalytical case.

For AI-assisted peak integration in bioanalytical methods (the narrow scope already in ICH M10 v2 drafting), the validation question is concrete. The reference test set is a curated chromatogram library. The acceptance criterion is integration-area agreement with the reference labels — typically ±5% per peak, with no peak falling outside ±10%. Drift detection is comparison of routine integration outputs against the reference distribution. Retraining triggers a partial revalidation under M10 §6, scope-defined — change in column chemistry, change in extraction method, change in instrument generation. The framework already exists; the AI sits inside it, not parallel to it.

/ 05The generative case.

For LLM-augmented document drafting — adverse-event narrative generation, regulatory submission section drafting, eTMF document classification — non-determinism is more substantial. The output is text, not a number. Validation moves to evaluation rubrics, golden-document comparison, human-review gates, and structured prompts treated as specifications under change control. The current FDA-PMDA-EMA position is that human-in-the-loop is mandatory for any output entering a regulated document; the system is validated, not the unsupervised output. Sponsors who skip the human-review gate are taking on liability that no validation report neutralises.

Validation under non-determinism is not the abandonment of rigour. It is the application of rigour to a system whose behaviour is statistical rather than mechanical. The operational shift, once made, is unremarkable. The shift not made is the audit finding.

Filed under: validation · GAMP 5 · non-deterministic · AI/ML All notes →