The Ops Zoo
Operational Thresholds and Training-Run Profiles
The Ops Zoo provides operational anchors for fleet operations — PSI drift thresholds, KS-test coefficients, memory bit-error rates, and reusable training-run goodput-loss profiles used in MLOps, robust-AI, and distributed-training chapters.
| Threshold | Value |
|---|---|
| PSI warn | 0.1 |
| PSI review | 0.2 |
| PSI critical | 0.25 |
| KS coefficient | 1.36 |
| Memory BER / bit | 1e-17 |
Training Run Overheads
| Overhead | Fraction | Description |
|---|---|---|
| Checkpoint overhead | 0.03 | Asynchronous checkpointing overhead fraction. |
| Failure recovery overhead | 0.10 | Failure and restart overhead fraction at 10k+ GPU scale. |
| Maintenance overhead | 0.05 | Rolling upgrade and maintenance-window overhead fraction. |
| Pipeline bubble overhead | 0.05 | Pipeline-parallel bubble overhead fraction for a well-tuned training run. |
Python Access
import mlsysim
psi_warn = mlsysim.Ops.Monitoring.PsiWarnThreshold
psi_critical = mlsysim.Ops.Monitoring.PsiCriticalThreshold
checkpoint_overhead = mlsysim.Ops.TrainingRunOverheads.CheckpointThese are assumption tables for teaching and appendix lineage — not live alerting defaults for production systems.