Data Foundations

Purpose

What makes “data” a first-class systems constraint, and how do we measure when it is silently breaking our models?

In ML systems, data is not an abstract dataset but physical volume that must move through disks, networks, CPUs, and accelerator memory. Many expensive training runs are limited not by FLOP/s, but by I/O bandwidth, serialization overhead, and avoidable scans of irrelevant bytes. In production, the more dangerous failure mode is quieter: distributions drift, tails dominate user experience, and accuracy degrades long before average metrics look suspicious. This appendix collects the reference calculations and statistical tools for reasoning about the data path as a systems engineer: data gravity napkin math, format and serialization costs, the algebraic primitives that create pipeline blowups, and drift metrics that compare full distributions rather than means. In D·A·M terms, it isolates the data axis and shows how volume, movement, and distribution constrain algorithm behavior and machine utilization.

How to Use This Appendix

This appendix is designed as a reference. Reach for it when debugging “slow training,” “low accelerator utilization,” or “production accuracy drift” calls for the quickest path from symptom to measurement.

Conventions used here follow the book-wide notation (for example, we reserve \(B\) for batch size and use \(\text{BW}\) for bandwidth).

  • When data will not move: Start with table 1 and the transfer-time equation in section 1.1.
  • When the accelerator is starving: Use table 2 and the layout discussion in section 1.1.3.
  • When pipelines explode in cost: Use the primitives in section 1.1.4, especially join-induced shuffles.
  • When “average looks fine” but users complain: Use section 1.2.1 and session-level tail probability.
  • When accuracy drifts silently: Use section 1.2.2 to compare full distributions.

With those reference points in place, the appendix begins with the physical constraints of data engineering and then moves to the statistical monitoring that keeps ML systems healthy. From storage formats that determine I/O throughput to drift metrics that detect silent failures, these foundations connect directly to the data pipelines in Data Engineering and the operational monitoring in ML Operations.

Data Engineering Foundations

Understanding hardware constraints is only half the battle; we must also shape our data to fit them. Data engineering applies the principles of the memory hierarchy to storage formats and pipeline design, ensuring that the accelerator never starves. This process begins with recognizing that data is physical—it has volume, it takes time to move, and it requires energy to parse.

Napkin math: The physics of data gravity

Data gravity1 is not a metaphor; it is a calculation of transfer time. Unlike compute, which gets faster every year, the speed of light is fixed and network bandwidth is a finite resource. When datasets grow large enough, they effectively become stationary—moving them costs more time and energy than moving the computation to where the data already lives.

1 Data Gravity: Coined by Dave McCrory in 2010 to describe how large datasets attract services and applications toward them, much as massive bodies attract smaller ones in physics (McCrory 2010). The analogy is apt: the “escape velocity” required to move a petabyte-scale dataset is often measured in weeks.

McCrory, Dave. 2010. Data Gravity – in the Clouds.

The transfer time for moving a dataset is simply \(T = D_{\text{vol}} / \text{BW}\) (for the large volumes here, latency is negligible; the full equation appears in Bandwidth vs. latency). Table 1 illustrates the sobering reality:

Table 1: The Cost of Inertia: Why we “ship compute to data” rather than “ship data to compute.” The truck entry is an illustrative elapsed-time model for 1 PB: 8 hours to load, 32 hours in transit, and 8 hours to unload, excluding scheduling and provisioning overhead. At 1 PB, the network is often not a viable option.
Data Volume 1 Gbps (Standard WAN) 10 Gbps (High-End WAN) 100 Gbps (Direct Connect) Truck (Illustrative)
1 TB 2.2 hours 13.3 minutes 80 seconds N/A
100 TB 9 days 22.2 hours 2.2 hours N/A
1 PB 3 months 9 days 22.2 hours 2 days

The cost of serialization

Even after data arrives at the machine, we face one final hurdle: the serialization tax.2 As table 2 shows, many engineers meticulously optimize their accelerator kernels while ignoring the CPU overhead of decoding data. Parsing text-based formats like JavaScript Object Notation (JSON) or CSV is extremely CPU-intensive, often leaving the accelerator idling while the CPU struggles to convert strings into floating-point numbers.

2 Serialization: From Latin serialis (forming a series). The process of converting in-memory data structures into a byte stream for storage or transmission, and the reverse (deserialization). In ML pipelines, the choice of serialization format can dominate end-to-end training time; Data Engineering covers pipeline design strategies that minimize this overhead.

Table 2: Serialization Overhead: Zero-copy formats like Arrow are at least 10\(\times\) faster than row-based text formats in this representative comparison because they align directly with internal memory structures.
Format Decoding Speed (MB/s) Relative CPU Decode Cost Suitability
CSV/JSON ~100 MB/s High Debugging only
Protobuf ~300 MB/s Medium RPC/Messages
Parquet/Arrow > 1,000 MB/s Low High-Scale Training

Row vs. columnar formats

The choice of file format determines the “physics” of how the data is read. In row-oriented formats such as CSV and JSON, data is stored record-by-record, so reading just the age column requires scanning every byte of every row: efficient for writing an appended log, but inefficient for analytics that train on specific features. By contrast, column-oriented formats such as Parquet and Arrow store data column-by-column, so reading age seeks to that column’s block and reads it sequentially. This enables projection pushdown (reading only the bytes required) and vectorized processing (single instruction, multiple data (SIMD) operations on columns). Compare the two arrangements side by side in figure 1 to see why columnar access avoids scanning unnecessary bytes.

Figure 1: Storage Layouts: Row-oriented formats pack data together by record (good for transactions). Column-oriented formats pack data by feature (good for analytics).

This layout difference becomes a systems issue whenever accelerator utilization depends on data loading speed.

Systems Perspective 1.1: The accelerator starvation problem
The choice of file format determines whether a system is I/O bound or compute bound. As table 2 shows, the serialization tax compounds with storage layout: row-oriented formats force full-row scans while columnar formats enable projection pushdown, reading only the bytes the model needs. The result is that the “Data Movement” term in the iron law can silently become the bottleneck that leaves expensive accelerators idling.

The algebra of data

Feature engineering turns raw records into the columns a model can learn from, and that transformation is usually a dataflow problem before it is a modeling problem. A pipeline first narrows the population, then chooses the features, then attaches context from other tables. Those three moves correspond to three Structured Query Language (SQL) primitives, and their computational cost determines whether the feature job stays local, scans unnecessary bytes, or turns into a network shuffle.

  1. Selection (\(\sigma\)): Filtering rows (for example, WHERE age > 30).
    • Cost: Cheap (\(\mathcal{O}(\log N)\)) if indexed; expensive (\(\mathcal{O}(N)\)) if full scan.
  2. Projection (\(\pi\)): Selecting columns (for example, SELECT age).
    • Cost: Free in columnar formats (only read relevant blocks). In row formats, the entire 1 KB row must be read just to extract the 4-byte integer, wasting 99.6 percent of I/O bandwidth.
  3. Join (\(\bowtie\)): Combining tables. The most expensive operation.
    • Shuffle Join: Both tables are partitioned by key and exchanged over the network.
      • Cost: Massive network traffic. Joining two 1 TB tables requires moving ~2 TB over the network.
    • Broadcast Join: One small table is sent to all workers.
      • Cost: Minimal network, but the small table must fit in RAM.

Understanding data formats, serialization costs, and algebraic primitives tells us how to move data efficiently. Even a perfectly engineered pipeline, however, can silently fail if the data it carries changes character over time. Detecting that change—and quantifying how much it matters—requires a different set of tools: probability and statistics.

Probability and Statistics

Once data is flowing through our pipelines, we need mathematical tools to ensure its quality and consistency. Probability and statistics provide the language for monitoring system health, detecting the silent failures of data drift, and managing uncertainty.

Systems Perspective 1.2: Why statistics matters for systems
Average latency can look fine while users still experience failures because systems live in the “long tail.” Statistics gives us the tools to measure uncertainty, detect drift, and handle numerical stability—three capabilities that separate robust production systems from fragile ones.

Distributions and the long tail

In systems, the mean is often misleading. Latency distributions are often long-tailed, and user-visible services are commonly governed by high-percentile behavior rather than mean response time (Dean and Barroso 2013). A “P99” (99th percentile) latency of 500 ms means 1 percent of requests experience that tail latency; with many requests per session, the fraction of users who see at least one slow request can be much higher. At scale (1M users), even a 1 percent affected-user rate would affect 10,000 users.

Napkin Math 1.1: The median experience
If a user session involves \(N_{\text{session}} = 100\) requests (common for a web page load or chat session), the probability of experiencing at least one P99 latency spike is \(\Pr(\text{Slow}) = 1 - (0.99)^{100} \approx 1 - 0.366 = 63.4\%\).

Computed: \(\Pr(\text{Slow})\) = 63.4 percent. For a heavy user (\(N_{\text{session}}\) is large), the “tail” latency becomes their median experience. This is why large production systems optimize for high-percentile latency such as P99.9 or P99.99 (Dean and Barroso 2013; DeCandia et al. 2007).

DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. “Dynamo: Amazon’s Highly Available Key-Value Store.” Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles, 205–20. https://doi.org/10.1145/1294261.1294281.
Dean, Jeffrey, and Luiz André Barroso. 2013. “The Tail at Scale.” Communications of the ACM 56 (2): 74–80. https://doi.org/10.1145/2408776.2408794.

Measuring drift (divergence)

Because distributions have long tails where the most dangerous failures hide, simple metrics like “mean shift” are insufficient. We need tools that compare the entire shape of the distribution. Detecting drift between the serving distribution (\(P_t\)) and training distribution (\(P_0\)) requires measuring the “distance” between distributions.

KL divergence3 measures how much information is lost if we approximate \(P_t\) with \(P_0\) (Kullback and Leibler 1951): \[ \mathcal{D}_{\text{KL}}(P_t \lVert P_0) = \sum_x P_t(x) \log \frac{P_t(x)}{P_0(x)} \]

3 KL Divergence: Named after Solomon Kullback and Richard Leibler, who introduced it in 1951. Also called relative entropy, it quantifies the expected extra information needed to encode samples from one distribution using a code optimized for another; for drift monitoring in this volume, the canonical direction is \(\mathcal{D}_{\text{KL}}(P_t \lVert P_0)\). With natural logarithms, as in the example above, the unit is nats; with base-2 logarithms, the unit is bits. In production ML, KL divergence is the theoretical backbone of many drift-detection metrics.

Kullback, Solomon, and Richard A. Leibler. 1951. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22 (1): 79–86. https://doi.org/10.1214/aoms/1177729694.

Napkin Math 1.2: Worked example: KL divergence for drift detection
Scenario: A sentiment classifier was trained on data where 60 percent of reviews were positive, 30 percent negative, and 10 percent neutral. After deployment, the serving distribution shifts to 45 percent positive, 40 percent negative, and 15 percent neutral.

Training distribution \(P_0\): [0.60, 0.30, 0.10]. Serving distribution \(P_t\): [0.45, 0.40, 0.15].

\[\begin{gather*} \mathcal{D}_{\text{KL}}(P_t \lVert P_0) = 0.45 \log\frac{0.45}{0.60} + 0.40 \log\frac{0.40}{0.30} + 0.15 \log\frac{0.15}{0.10}\end{gather*}\] \[\begin{gather*} = 0.45 \times (-0.2877) + 0.40 \times (0.2877) + 0.15 \times (0.4055) \\[-1pt] = -0.1295 + 0.1151 + 0.0608 = 0.0464\text{ nats}\end{gather*}\] \[\begin{gather*} \mathcal{D}_{\text{KL}}(P_0 \lVert P_t) = 0.60 \log\frac{0.60}{0.45} + 0.30 \log\frac{0.30}{0.40} + 0.10 \log\frac{0.10}{0.15}\end{gather*}\] \[\begin{gather*} = 0.60 \times 0.2877 + 0.30 \times (-0.2877) + 0.10 \times (-0.4055) \\[-1pt] = 0.1726 + (-0.0863) + (-0.0405) = 0.0458\text{ nats}\end{gather*}\]

Notice the asymmetry: \(\mathcal{D}_{\text{KL}}(P_0 \lVert P_t) \neq \mathcal{D}_{\text{KL}}(P_t \lVert P_0)\). The Population Stability Index (PSI) symmetrizes this:

\[\begin{gather*} \text{PSI} = \sum (P_{0,i} - P_{t,i}) \log\frac{P_{0,i}}{P_{t,i}} = (0.15)(0.2877) + (-0.10)(-0.2877) + (-0.05)(-0.4055) \\[-1pt] = 0.0432 + 0.0288 + 0.0203 = 0.0922 \end{gather*}\]

Since PSI = 0.0922 < 0.2, this drift is noticeable but does not yet trigger a retraining alert. However, monitoring should increase in frequency.

Because KL divergence is asymmetric (\(\mathcal{D}_{\text{KL}}(P_0 \lVert P_t) \neq \mathcal{D}_{\text{KL}}(P_t \lVert P_0)\)), practitioners often use PSI, a symmetric metric derived from KL divergence that is easier to threshold. A PSI > 0.2 typically triggers an alert for retraining.

Both KL divergence and PSI are grounded in a deeper framework—information theory—which provides the units and bounds that make these metrics principled rather than ad hoc.

Information theory for systems

A training run can consume more examples, run longer, and still stop improving if the added data carries too little useful signal. At that point, the bottleneck is not only compute or bandwidth; it is the amount of information the data pipeline delivers to the learner. Information roofline (the destination) treats that data quality limit as a physical constraint, and information theory provides the units for reasoning about it.

Entropy (\(H\))4 is the average information content (uncertainty) in a distribution, defined as \(H(X) = -\sum p(x) \log p(x)\). The log base sets the unit: natural logs give nats, while \(\log_2\) gives bits. A uniform distribution has maximum entropy (maximum uncertainty). This connects directly to KL divergence above: \(\mathcal{D}_{\text{KL}}\) measures excess information needed when using the wrong distribution.

4 Entropy: From Greek entropia (transformation). Shannon borrowed the term from thermodynamics in 1948 to quantify information content (Shannon 1948). In systems terms, entropy quantifies the theoretical minimum code length needed to encode a message from a source, measured in bits when using \(\log_2\) and nats when using natural logs, directly relevant to compression ratios and data pipeline sizing.

Shannon, Claude E. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal 27 (3): 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.

Information Density is the amount of useful signal per unit of storage. High-quality data has high information density; noisy data has low density.

Signal-to-Noise Ratio (SNR) is the ratio of useful information to irrelevant variance. In ML, training on low-SNR data is like trying to learn a pattern from static—the system hits the information roofline where adding more compute yields no improvement.

Logits and numerical stability

Neural networks output logits5 (unnormalized scores), not probabilities. We convert them using Softmax:

5 Logit: From log + unit, coined by Joseph Berkson in 1944. The logit function is the inverse of the logistic (sigmoid) function: \(\text{logit}(p) = \log(p/(1-p))\). In deep learning, “logits” refers more loosely to the raw, unnormalized output of the final linear layer before any activation function is applied. \[ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum e^{z_j}} \]

6 Log-Sum-Exp: Implemented as torch.logsumexp in PyTorch and scipy.special.logsumexp in SciPy. It relies on the identity \(\log\left(\sum e^{x_i}\right) = a + \log\left(\sum e^{x_i - a}\right)\), where \(a = \max(x_i)\). Shifted values \(x_i - a\) are \(\le 0\), ensuring exponentials never overflow.

The problem is that if \(z_i\) is large (for example, 100), the exponential \(e^{z_i}\) overflows common training and inference formats such as FP32, BF16, and FP16. FP64 can represent this specific value, but production ML kernels rarely use FP64 for softmax. The solution is to compute in log-space: the “Log-Sum-Exp” trick allows us to compute \(\log\left(\sum e^{z_j}\right)\) without ever calculating the massive exponentials directly, preserving numerical precision.6

A small softmax calculation shows why this trick matters in practice, even for large but realistic logit values:

Napkin Math 1.3: Worked example: Log-sum-exp in action
The setup: A three-class classifier outputs logits \(z =\) \(\lbrack 100, 101, 102 \rbrack\).

Without the trick (naive softmax): \[ \text{$\exp(100) \approx 2.7 \times 10^{43}$},\quad \text{$\exp(101) \approx 7.3 \times 10^{43}$}, \quad \text{$\exp(102) \approx 2.0 \times 10^{44}$} \]

These numbers are representable in FP64 but overflow FP32 (max \(\approx 3.4 \times 10^{38}\)). With FP16 (max \(\approx 65{,}504\)), even \(e^{12}\) overflows. In practice, logits of magnitude 100 are not unusual in deep networks, and FP16/BF16 is the standard training precision—so naive softmax fails routinely.

With the trick: Subtract \(a = \max(z) =\) 102: \[\begin{gather*} \text{$z - a =$ $\lbrack -2, -1, 0 \rbrack$} \\ \text{$\exp(-2) \approx 0.135$,}\quad \text{$\exp(-1) \approx 0.368$,}\quad \text{$\exp(0) = 1.0$} \end{gather*}\]

Sum = 1.503. LogSumExp = 102 \(+ \log(1.503) = 102.408\).

Softmax: \(\lbrack 0.135/1.503,\; 0.368/1.503,\; 1.0/1.503 \rbrack = \lbrack 0.090,\; 0.245,\; 0.665 \rbrack\)

The exponentiated shifted values and the resulting softmax probabilities are in \([0, 1]\)—no overflow risk, even in FP16.

These worked examples now give us enough machinery to check the full chain, from physical data movement to distribution shift and numerical stability.

Checkpoint 1.1: Check your understanding
  1. A training pipeline reads 500 GB of CSV data over a 10 Gbps link. Estimate the transfer time. Now estimate how long it would take if the data were stored as Parquet and only 20 percent of columns were needed—what changes and why?

  2. Your model’s average latency is 50 ms, but P99 is 800 ms. If a typical user session involves fifty requests, what is the probability that a user experiences at least one P99 spike? Does “average latency” adequately describe user experience?

  3. Explain why KL divergence is asymmetric and why this matters when choosing a drift metric for production monitoring. When would you prefer PSI over raw KL divergence?

The tools in this section—tail-aware metrics, drift divergences, information-theoretic bounds, and numerical stability tricks—give us the vocabulary to diagnose data-related failures quantitatively rather than anecdotally.

Summary

Key Takeaways: Data as a physical constraint
  • Data has physical inertia: Transfer time scales linearly with volume and inversely with bandwidth, making petabyte-scale datasets effectively immovable. Design pipelines around data locality rather than data movement.
  • Serialization format is a first-order decision: Columnar binary formats (Parquet, Arrow) can be at least 10\(\times\) faster to decode than text formats (CSV, JSON) in representative comparisons, directly determining whether accelerators starve or stay fed.
  • Algebraic primitives shape I/O cost: Selection, projection, and join have radically different I/O costs. Joins in particular can require moving both input tables across the network; choosing between shuffle and broadcast joins depends on relative table sizes.
  • Average metrics hide tails: Long-tailed distributions mean that the P99 latency becomes the typical user experience at scale, and monitoring only the mean creates dangerous blind spots.
  • Drift detection compares distributions: KL divergence and symmetric population-stability measures provide principled ways to detect distribution shift before accuracy metrics react.
  • Numerical stability is mandatory: The log-sum-exp trick and log-space computation prevent overflow in softmax and loss calculations, making them essential building blocks of any training or inference pipeline.
Back to top