D·A·M Foundations

Purpose

When a single node fails, which axis is binding first: the data path, the algorithm, or the machine?

In production, “it is slow” and “it is wrong” are rarely informative symptoms. A serving stack can miss its latency objective because the accelerator is idle (data starvation), because the model is doing unnecessary work (algorithmic overhead), or because the accelerator is genuinely saturated (machine-bound). The C³ taxonomy extends these diagnostics to the distributed fleet, but it relies on a firm foundation of single-node performance. Without understanding D·A·M, teams often optimize the wrong thing, buying faster accelerators to fix a slow input pipeline or rewriting kernels when the model is simply too large for the latency budget. This appendix provides a compact diagnostic framework, Data, Algorithm, and Machine, and maps single-machine symptoms and measurements to the term of the iron law that dominates. In C³ terms, this single-machine refresher is the node-level base that fleet-scale compute, communication, and coordination build on.

Learning Objectives

Classify single-machine bottlenecks by dominant Data, Algorithm, or Machine axis while recognizing mixed causes
Map optimization techniques to their Data-Algorithm-Machine intersection zone to understand which axes they span
Apply the iron law equation to quantitatively diagnose performance problems
Distinguish between memory-bound and compute-bound workloads using Arithmetic Intensity
Select appropriate profiling tools and optimization strategies for each Data-Algorithm-Machine axis
Evaluate system health using Data-Algorithm-Machine scorecard metrics (I/O Overhead, Active Params, MFU)

How to Use This Appendix

This appendix is designed as a reference for single-node performance. Start with the scorecard-style metrics, form a hypothesis about which axis dominates, and then pick the tool that can confirm (or falsify) that hypothesis.

When training is slow on a single GPU, check utilization, data wait time, and MFU, then map each to its Data, Algorithm, or Machine axis. When serving misses a latency target, identify whether the regime is latency-bound (overhead), memory-bound (weight/KV movement), or compute-bound. When cost is exploding, use the D·A·M rubric to ensure that effort targets the dominant term, not a nonbottleneck.

The Data · Algorithm · Machine (D·A·M) taxonomy is the primary diagnostic framework for ML systems engineering. It formalizes the interdependence between information flow, mathematical logic, and physical execution. When performance stalls or behavior degrades, ask: where is the flow blocked? This taxonomy helps practitioners decompose bottlenecks across three diagnostic axes¹, while recognizing that real systems can involve mixed causes or interactions between axes.

¹ MECE (Mutually Exclusive, Collectively Exhaustive): A classification principle from management consulting (popularized by McKinsey) requiring that categories do not overlap and together cover every possibility. Applied to systems engineering, it is useful as an idealized decomposition, but D·A·M bottlenecks often overlap in practice: a kernel can be simultaneously memory-bound, synchronization-heavy, and poorly matched to available hardware.

Diagnostic Summary

The taxonomy maps directly to the iron law of ML systems, introduced in The fleet law. Table 1 summarizes the role, primary physical constraint, and core optimization pathway for each axis.

Table 1: D·A·M Axis Reference: Each axis maps to a distinct physical constraint and a high-leverage optimization strategy. Start diagnosis here: identify which constraint is binding, then follow the optimization lever.

Axis	Role	Physical Constraint	High-Leverage Optimization
Data (D)	Information (The Fuel)	Bandwidth (\(\text{BW}\))	I/O Pipeline Optimization
Algorithm (A)	Logic (The Blueprint)	Operations (\(O\))	Model Compression
Machine (M)	Physics (The Engine)	Throughput (\(R_{\text{peak}}\))	Hardware Acceleration

Iron Law Mapping

With the three axes named, the next step is to connect them to time. The performance of any ML task is governed by the distribution of work across the D·A·M axes, and the iron law mapping reveals which component’s variables dominate execution:

\[ T = \underbrace{ \frac{D_{\text{vol}}}{\text{BW}} }_{\text{Data (D)}} + \underbrace{ \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} }_{\text{Algorithm (A)/Machine (M)}} + \underbrace{ L_{\text{lat}} }_{\text{Overhead}} \]

Algorithm and Machine share the compute term and are separated by which variable the engineer controls. Reducing the total operations (\(O\)) is an Algorithm lever, while improving the hardware’s peak throughput (\(R_{\text{peak}}\)) or utilization (\(\eta_{\text{hw}}\)) is a Machine lever.

D·A·M coordination: From sum to max

The additive iron law represents sequential execution—the worst case where Data, Algorithm, and Machine take turns. Skilled systems engineering transforms the sum into a max:

\[ T_{\text{sequential}} = \frac{D_{\text{vol}}}{\text{BW}} + \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} + L_{\text{lat}} \quad \xrightarrow{\text{overlap}} \quad T_{\text{pipelined}} = \max\left(\frac{D_{\text{vol}}}{\text{BW}},\; \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}}\right) + L_{\text{lat}} \]

The systems engineer’s job is to make these components run in parallel, not in series. Table 2 summarizes key D·A·M Coordination techniques:

Table 2: D·A·M Overlap Techniques: Each technique allows one D·A·M axis to execute while another is in flight, converting the iron law’s additive terms into overlapped terms.

Technique	D·A·M Axes Overlapped	Implementation
Prefetching	D overlaps M	DataLoader with `prefetch_factor`, `pin_memory=True`
CUDA Streams	D overlaps M	Separate streams for H2D transfer and compute
Async Gradient Sync	M (communication) overlaps A	Overlap gradient AllReduce with remaining backward computation as gradient buckets become ready
Double Buffering	D overlaps M	Fill buffer \(i+1\) while computing on buffer \(i\)

Arithmetic Intensity Boundary

This overlap view still leaves one practical question: whether a workload is waiting on bytes or on FLOPs. The boundary between Data (memory bound) and Machine (compute bound) is not arbitrary; it is defined mathematically by Arithmetic Intensity² (\(I\)) of the workload.

² Arithmetic Intensity: The ratio of floating-point operations to bytes transferred (FLOP/byte). It determines whether a workload is memory-bound or compute-bound by comparison against the hardware’s ridge point (\(R_{\text{peak}}/\text{BW}\)).

The symptoms below are screening signals, not diagnoses. A workload whose arithmetic intensity sits left of \(R_{\text{peak}}/\text{BW}\) cannot keep the accelerator fed with enough useful work per byte, but raw utilization must be read alongside memory-bandwidth utilization, data-loader wait, kernel occupancy, and trace gaps.

Low GPU utilization (\(<\) 80 percent): Screen for data, CPU, or launch starvation. Confirm with data-loader wait, host CPU saturation, memory-bandwidth counters, and timeline gaps.
High GPU utilization (\(>\) 95 percent): Treat this as machine bound only if compute units are busy and memory bandwidth is not saturated. If memory bandwidth is saturated, the bottleneck is still on the Data/Machine boundary.
If batch size is 1: Treat this as a warning that latency or launch overhead may matter. For LLM decode, batch-1 execution is often memory-bandwidth bound unless traces show dispatch gaps.
If arithmetic intensity is below the ridge point (about 295.2 FLOP/byte on H100): The workload is likely memory bound (Data/Machine boundary).

Bottleneck diagnostic

Once the bottleneck is identified, table 3 shows which optimizations help and which ones are wasted:

Table 3: What Works vs. What Is Wasted: Optimizing the wrong term yields limited improvement and poor cost efficiency. A memory-bound model will not speed up from more peak FLOP/s alone; it needs higher arithmetic intensity, less data movement, or a faster memory subsystem.

If the workload is…	Dominant Term	Optimization That Works	Optimization That is Wasted
Memory-Bound	\(D_{\text{vol}}/\text{BW}\)	Quantization, pruning, batching, kernel fusion, higher memory bandwidth	More peak FLOP/s alone
Compute-Bound	\(O/(R_{\text{peak}} \cdot \eta_{\text{hw}})\)	Better kernels, Tensor Cores, faster GPU, lower precision	More memory bandwidth (already saturated)
Latency-Bound	\(L_{\text{lat}}\)	Batching requests, kernel fusion, async dispatch	Neither compute nor bandwidth (overhead dominates)

Tooling Map

After the action table narrows the likely fix, the tooling map identifies the measurements needed to verify it. Use table 4 to select the right profiling tool when diagnosing a bottleneck along a particular D·A·M axis:

Table 4: D·A·M Tooling Map: Profiling utilities for diagnosing bottlenecks along each D·A·M axis.

Axis	Key Metric	Primary Tool	Secondary Tool
Data	Batch Load Time	`tqdm` (iterations/sec)	`iotop`, `dstat` (Disk I/O)
Algorithm	FLOPs, Model Depth	PyTorch Profiler	DeepSpeed Flops Profiler
Machine	GPU Utilization, SM Occupancy	`nvidia-smi`	Nsight Compute, Nsight Systems

D·A·M Scorecard

The efficiency ratios in table 5 grade a system’s performance against its theoretical limit. This “Report Card” is anchored by MFU³—the ratio of achieved model FLOP/s to the hardware’s theoretical peak FLOP/s.

³ MFU (Model FLOPs Utilization): Measures only useful model computation, excluding overhead like gradient synchronization and memory management.

Table 5: The D·A·M Efficiency Rubric: Use these three numbers to characterize single-machine maturity.

Axis	Metric	Definition	Failing Grade	Passing Grade
Data	I/O Overhead	\(\frac{\text{Data Wait Time}}{\text{Total Step Time}}\)	\(>\) 10%	\(<\) 1%
Algorithm	Compression/Sparsity Ratio	\(\frac{\text{Effective or active parameters}}{\text{Dense baseline parameters}}\)	Workload-dependent	Workload-dependent
Machine	MFU	\(\frac{\text{achieved model FLOP/s}}{\text{peak FLOP/s}}\)	\(<\) 30%	\(>\) 50%

Scaling the Taxonomy: From Node to Fleet

These single-node grades become most useful when they are carried forward into distributed design. The D·A·M taxonomy is the diagnostic baseline for a single machine, but as we move from a single node to the Machine Learning Fleet, each axis undergoes a qualitative transformation. Understanding these “tie-ins” is essential for transitioning from local optimization to fleet-scale engineering.

The evolution of constraints

Table 6 maps each axis from its node-level focus to its fleet-scale transformation:

Table 6: D·A·M Constraints from Node to Fleet: How each axis of the D·A·M taxonomy shifts when scaling from a single node to a multi-node fleet. The node-level focus (left) gives way to a qualitatively different fleet-scale constraint (right)—the bottleneck migrates from local resource to global coordination cost.

Axis	Node-Level Focus (D·A·M)	Fleet-Scale Transformation
Data (D)	I/O Bandwidth (Disk/PCIe)	The communication wall: The bottleneck shifts from local storage to the Bisection Bandwidth of the network fabric.
Algorithm (A)	Model Depth/Ops Count	The parallelism strategy: The logic now includes how we partition the math across \(N\) devices (3D Parallelism).
Machine (M)	Peak TFLOP/s, HBM	The power and reliability wall: The constraint is no longer just silicon speed, but Watts per rack and cluster-wide MTBF.

Bridging to \(C^3\)

While D·A·M diagnoses the components of a single node, the \(C^3\) Taxonomy (The C^3 Taxonomy) diagnoses the interactions of the fleet.

Compute (\(C_1\)) inherits the Algorithm and Machine axes, but adds the loss of scaling efficiency.
Communication (\(C_2\)) inherits the Data axis, but is governed by the speed of light and network topology rather than just local I/O.
Coordination (\(C_3\)) is the “at scale” tie-in that has no single-node equivalent. It represents the coordination tax—the time spent on synchronization, checkpoints, and failure recovery that only emerges at fleet scale.

This progression ensures that single-node efficiency (high MFU) is never traded off for fleet-scale inefficiency (low scaling efficiency). We optimize the node to serve the fleet.

Summary

The D·A·M taxonomy provides the diagnostic baseline for every ML systems analysis. By isolating the bottleneck to Data, Algorithm, or Machine, practitioners ensure that optimization efforts target the binding constraint. This single-node discipline is the prerequisite for the fleet-scale engineering addressed throughout this volume.

Key Takeaways: Single-machine diagnostic heuristics

Identify the dominant axis: Decide whether Data, Algorithm, or Machine is the binding constraint before proposing any optimization.
Profile arithmetic intensity: Arithmetic intensity quantitatively distinguishes between Data-bound and Machine-bound regimes.
Use the iron law: The iron law transforms vague symptoms into specific term-based bottlenecks.
Grade with the scorecard: The Data-Algorithm-Machine scorecard standardizes “good” performance before moving to fleet scale.