The Ops Zoo

Operational Thresholds and Training-Run Profiles

The Ops Zoo provides operational anchors for fleet operations — PSI drift thresholds, KS-test coefficients, memory bit-error rates, and reusable training-run goodput-loss profiles used in MLOps, robust-AI, and distributed-training chapters.

Threshold Value
PSI warn 0.1
PSI review 0.2
PSI critical 0.25
KS coefficient 1.36
Memory BER / bit 1e-17

Training Run Overheads

Overhead Fraction Description
Checkpoint overhead 0.03 Asynchronous checkpointing overhead fraction.
Failure recovery overhead 0.10 Failure and restart overhead fraction at 10k+ GPU scale.
Maintenance overhead 0.05 Rolling upgrade and maintenance-window overhead fraction.
Pipeline bubble overhead 0.05 Pipeline-parallel bubble overhead fraction for a well-tuned training run.

Python Access

import mlsysim

psi_warn = mlsysim.Ops.Monitoring.PsiWarnThreshold
psi_critical = mlsysim.Ops.Monitoring.PsiCriticalThreshold
checkpoint_overhead = mlsysim.Ops.TrainingRunOverheads.Checkpoint

These are assumption tables for teaching and appendix lineage — not live alerting defaults for production systems.

Back to top