The D·A·M Taxonomy

Purpose

When an ML system fails, where should you look first: the data path, the algorithm, or the machine?

In production, “it is slow” and “it is wrong” are rarely informative symptoms. A serving stack can miss its latency objective because the accelerator is idle (data starvation), because the model is doing unnecessary work (algorithmic overhead), or because the accelerator is genuinely saturated (machine-bound). Without a taxonomy, teams often optimize the wrong thing, buying faster accelerators to fix a slow input pipeline or rewriting kernels when the model is simply too large for the latency budget. This appendix provides a compact diagnostic framework, Data, Algorithm, and Machine, and shows how to map symptoms and measurements to the term of the iron law that dominates. D·A·M is the first-response checklist before committing to deeper optimization.

How to Use This Appendix

This appendix is designed as a reference. Start with the scorecard-style metrics, form a hypothesis about which axis dominates, and then pick the tool that can confirm (or falsify) that hypothesis. Conventions used here follow the book-wide notation (for example, we reserve \(B\) for batch size and use \(\text{BW}\) for bandwidth).

When training is slow, check accelerator utilization, data wait time, and Model FLOPs Utilization (MFU), then map each to its Data, Algorithm, or Machine axis. When serving misses a Service Level Objective (SLO), identify whether the regime is latency-bound (overhead), memory-bound (weight/KV movement), or compute bound. When cost is exploding, use the D·A·M rubric to ensure that effort targets the dominant term, not a nonbottleneck.

The Data · Algorithm · Machine (D·A·M) taxonomy is the primary diagnostic framework for ML systems engineering. It formalizes the interdependence between information flow, mathematical logic, and physical execution. When performance stalls or behavior degrades, the diagnostic task is to identify where the flow is blocked. This taxonomy helps practitioners isolate the dominant bottleneck among three collectively exhaustive axes1, while recognizing that real systems often involve boundary cases where two or more axes interact.

1 MECE (Mutually Exclusive, Collectively Exhaustive): A classification principle from management consulting (popularized by McKinsey) requiring that categories do not overlap and together cover every possibility. D·A·M uses the exhaustive part as a first-pass diagnostic: every bottleneck should be explainable through Data, Algorithm, Machine, or their interactions, even when the cleanest diagnosis names a dominant axis plus a boundary effect.

Diagnostic Summary

The taxonomy maps directly to the iron law of ML systems established in Iron Law of ML Systems. Table 1 summarizes the role, primary physical constraint, and core optimization pathway for each axis.

Table 1: D·A·M Axis Reference: Each axis maps to a distinct physical constraint and a high-leverage optimization strategy. Start diagnosis here: identify which constraint is binding, then follow the optimization pointer to the relevant chapter.
Axis Role Physical Constraint High-Leverage Optimization
Data (D) Information (The Fuel) Bandwidth (\(\text{BW}\)) Data Selection (Data Selection)
Algorithm (A) Logic (The Blueprint) Operations (\(O\)) Model Compression (Model Compression)
Machine (M) Physics (The Engine) Throughput (\(R_{\text{peak}}\)) Hardware Acceleration (Hardware Acceleration)

This clean separation is useful as a first diagnostic step, but production systems rarely suffer from a single pure-axis bottleneck. More often, the problem sits at the boundary between two axes—a data format choice that determines whether the GPU can be saturated, or a pruning strategy that changes the memory access pattern. To handle these cases, we need to map the intersections.

Intersection Landscape

Real systems engineering lives at the boundaries between axes. Figure 1 maps the intersection landscape: what concepts and techniques emerge when two or three axes overlap.

Figure 1: The D·A·M Intersection Landscape: Each circle represents a pure domain: Data (information), Algorithm (logic), and Machine (physics). The pairwise intersections capture techniques that require reasoning about two domains simultaneously. The center—where all three converge—is ML Systems Engineering itself: the discipline of balancing data flow, algorithmic complexity, and hardware constraints within a single system.

Table 2 provides a scannable reference for each zone.

In the D·A·M acronym, Data, Algorithm, and Machine are taxonomy axes. They are not mathematical variables; formal quantities still follow the notation chapter, where \(D\) denotes dataset size or training tokens.

Table 2: D·A·M Intersection Reference: Each zone maps specific techniques to the axes they span and the chapters that cover them. The pairwise intersections require reasoning about two domains simultaneously; the center requires all three.
Zone Name Key Techniques Book Coverage
Data (pure) Information Storage formats, data quality, distributions Data Engineering
Algorithm (pure) Logic Loss functions, architectures, gradients Neural Computation, Network Architectures
\(\mathsf{Data} \cap \mathsf{Algorithm}\) What to Learn From Data selection, curriculum learning, compute-optimal scaling Data Selection, Model Training
\(\mathsf{Data} \cap \mathsf{Machine}\) How to Move Information I/O bandwidth, prefetching, data formats Data Engineering, Hardware Acceleration
\(\mathsf{Algorithm} \cap \mathsf{Machine}\) How to Execute Efficiently Quantization, pruning, kernel fusion, mixed precision ML Frameworks, Model Compression
Machine (pure) Physics Silicon, memory hierarchy, peak FLOP/s Hardware Acceleration
\(\mathsf{Data} \cap \mathsf{Algorithm} \cap \mathsf{Machine}\) ML Systems Engineering iron law, Roofline, training loops, serving Model Training, Model Serving, Benchmarking

The pure zones contain concepts that belong entirely to one axis: storage formats and distribution properties are purely Data concerns, loss functions and gradient computations are purely Algorithm, and silicon physics and peak FLOP/s are purely Machine. These are the topics where single-domain expertise suffices.

The pairwise intersections are where systems thinking begins. \(\mathsf{Data} \cap \mathsf{Algorithm}\) (What to Learn From) encompasses data selection, curriculum learning, active learning, and scaling laws like Chinchilla (\(D \approx 20P\))—all requiring joint reasoning about information content and algorithmic capacity. Adding data without considering whether the model can learn from it wastes compute; choosing architectures without considering data availability wastes engineering time. \(\mathsf{Data} \cap \mathsf{Machine}\) (How to Move Information) covers I/O bandwidth, prefetching strategies, data formats, and the energy-movement invariant. This intersection is where data gravity manifests: the physical cost of moving bytes through the memory hierarchy determines whether the machine can be fed fast enough. \(\mathsf{Algorithm} \cap \mathsf{Machine}\) (How to Execute Efficiently) spans quantization, pruning, kernel fusion, mixed precision, and computational graph optimization. A pruning strategy that reduces FLOPs but destroys memory access patterns can slow down execution on real hardware.

The center—\(\mathsf{Data} \cap \mathsf{Algorithm} \cap \mathsf{Machine}\)—is where all three axes converge. The iron law, the Roofline Model, end-to-end training loops, serving pipelines, and holistic benchmarking all require simultaneous reasoning about data flow, algorithmic complexity, and hardware utilization. This center is not a single technique; it is the discipline itself.

Understanding the landscape reveals where a technique lives. The next step is quantifying which axis dominates for a given workload—and for that, we need the iron law.

Iron Law Mapping

The performance of any ML task is governed by the distribution of work across the D·A·M axes. The iron law mapping reveals which component’s variables dominate the execution time: \[ T = \underbrace{ \frac{D_{\text{vol}}}{\text{BW}} }_{\text{Data (D)}} + \underbrace{ \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} }_{\text{Algorithm (A) / Machine (M)}} + \underbrace{ L_{\text{lat}} }_{\text{Overhead}} \]

Algorithm and Machine share the compute term, separated by which variable the engineer controls. Reducing the total operations (\(O\)) is an Algorithm lever, while improving the hardware’s peak throughput (\(R_{\text{peak}}\)) or utilization (\(\eta_{\text{hw}}\)) is a Machine lever.

This equation transforms performance debugging from a qualitative guessing game into a quantitative engineering problem. Every bottleneck hides in one of these terms. A slow system is one that is moving too much data (\(D_{\text{vol}}\)), lacking bandwidth (\(\text{BW}\)), executing too many operations (\(O\)), or failing to use the hardware’s peak capability (\(\eta_{\text{hw}}\)). The levers below map specific optimizations to the variable they improve.

Component levers

  • Data Lever: Reducing the volume of data (\(D_{\text{vol}}\)) through deduplication or curriculum learning, or increasing I/O bandwidth (\(\text{BW}\)).
  • Algorithm Lever: Reducing total arithmetic operations (\(O\)) through pruning, quantization, or architectural refinement.
  • Machine Lever: Increasing the denominator of the compute term by improving peak throughput (\(R_{\text{peak}}\)) or increasing the utilization factor (\(\eta_{\text{hw}}\)) via kernel fusion.

D·A·M coordination: From sum to max

The additive iron law represents sequential execution—the worst case where Data, Algorithm, and Machine take turns. Skilled systems engineering, however, transforms the sum into a max: \[ T_{\text{sequential}} = \frac{D_{\text{vol}}}{\text{BW}} + \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}} + L_{\text{lat}} \quad \xrightarrow{\text{overlap}} \quad T_{\text{pipelined}} = \max\left(\frac{D_{\text{vol}}}{\text{BW}}, \frac{O}{R_{\text{peak}} \cdot \eta_{\text{hw}}}\right) + L_{\text{lat}} \]

The systems engineer’s job is to make these components run in parallel, not in series. Table 3 summarizes key D·A·M Coordination techniques:

Table 3: D·A·M Overlap Techniques: Each technique allows one D·A·M axis to execute while another is in flight, converting the iron law’s additive terms into overlapped terms. The payoff is transforming \(T = a + b\) into \(T = \max(a, b)\), which can cut latency nearly in half when the terms are balanced.
Technique D·A·M Axes Overlapped Implementation
Prefetching D overlaps M DataLoader with prefetch_factor, pin_memory=True
CUDA Streams D overlaps M Separate streams for H2D transfer and compute
Async Gradient Sync M (communication) overlaps A Overlap bucketed AllReduce with remaining backward computation
Double Buffering D overlaps M Fill buffer N+1 while computing on buffer N

Overlap only helps when the D·A·M axes are reasonably balanced. If one term dominates (for example, severely memory bound), overlapping the smaller term with the larger yields negligible gain—the max is still dominated by the same bottleneck. Overlap provides the greatest benefit when \(D_{\text{vol}}/\text{BW} \approx O/(R_{\text{peak}} \cdot \eta_{\text{hw}})\). The latency term is the important exception.

Systems Perspective 1.1: The overhead that cannot hide
The latency term \(L_{\text{lat}}\) (kernel launch, synchronization barriers, Python dispatch) typically cannot be overlapped—it represents serialization points where all components must wait. This is why kernel fusion is so powerful: it eliminates \(L_{\text{lat}}\) by combining operations, not just by speeding up any single component.

The iron law tells the engineer how much time each axis consumes. One critical question remains: when the bottleneck sits at the boundary between Data and Machine, which side is binding? The answer lies in a single ratio.

Arithmetic Intensity Boundary

The boundary between Data (memory bound) and Machine (compute bound) is not arbitrary; it is defined mathematically by arithmetic intensity2 (\(I\)) of the workload.

2 Arithmetic Intensity: The ratio of floating-point operations to bytes transferred (FLOP/byte), introduced by Williams et al. (2009) as the key parameter in the Roofline Model. It determines whether a workload is memory bound or compute bound by comparison against the hardware’s ridge point (\(R_{\text{peak}}/\text{BW}\)).

Williams, Samuel, Andrew Waterman, and David Patterson. 2009. “Roofline: An Insightful Visual Performance Model for Multicore Architectures.” Communications of the ACM 52 (4): 65–76. https://doi.org/10.1145/1498765.1498785.

The roofline model provides rigorous definitions of arithmetic intensity and the roofline model. Use that model to quantitatively distinguish between Data and Machine bottlenecks before applying the optimizations below.

The Roofline Model provides exact answers when there is time to profile. In the middle of a production incident, however, a faster heuristic is needed—a set of quick thresholds that points to the right axis within seconds.

Rules of Thumb

In the heat of a production outage, there is rarely time to solve the full iron law equation. Veteran systems engineers instead rely on these quantitative heuristics to quickly narrow the search space; the thresholds below serve as a first line of defense.

  • Low accelerator utilization (\(<\) 80 percent): The workload is likely data bound (or CPU bound). The accelerator is starving.
  • High accelerator utilization (\(>\) 95 percent): The workload is likely machine bound. The accelerator is fully saturated.
  • If batch size is one: The workload is likely latency bound (algorithm overhead dominates).
  • Low arithmetic intensity (\(<\) 100 FLOP/byte): The workload is likely memory bound (Data/Machine boundary). This threshold is approximate for current-generation accelerators; compute the hardware’s specific ridge point (\(R_{\text{peak}}/\text{BW}\)) for a precise boundary.
  • If the system works in dev but fails in prod: Suspect data drift (Data component).

Common industry labels map to D·A·M components as follows: memory bound typically indicates a Data bottleneck (information cannot reach the accelerator fast enough), compute bound indicates a Machine bottleneck (the accelerator is fully saturated), and latency bound indicates an Algorithm bottleneck (serial operation depth or overhead dominates).

Bottleneck diagnostic

Once the bottleneck is identified, table 4 shows which optimizations help and which ones are wasted:

Table 4: What Works vs. What Is Wasted: Optimizing the wrong term yields exactly zero improvement. A memory-bound large language model (LLM) will not speed up from a faster accelerator; the accelerator will simply idle faster while waiting for memory.
If the workload is… Dominant Term Optimization That Works Optimization That is Wasted
Memory-Bound \(D_{\text{vol}}/\text{BW}\) Quantization, pruning, batching, kernel fusion Faster accelerator (more FLOP/s will not help)
Compute-Bound \(O/(R_{\text{peak}} \cdot \eta_{\text{hw}})\) Better kernels, Tensor Cores, faster accelerator, lower precision More memory bandwidth (not the binding term)
Latency-Bound \(L_{\text{lat}}\) Batching requests, kernel fusion, async dispatch Neither compute nor bandwidth (overhead dominates)

Knowing what works also means recognizing what does not. In practice, teams under deadline pressure repeatedly fall into the same traps—optimizing the wrong axis with confidence. These failure modes are common enough to deserve their own names.

Anti-Patterns

Diagnosing systems is often a process of elimination. Before committing to complex kernel optimizations, watch for these common traps that waste engineering cycles.

  • The hardware crutch: Buying faster accelerators (Machine) to fix a slow Python data loader (Data). The new hardware will just idle faster.
  • The model twiddle: Changing neural architectures (Algorithm) when the bottleneck is actually network bandwidth or disk I/O.
  • The premature optimizer: Writing custom CUDA kernels (Machine) before verifying if the Algorithm is simply doing too many unnecessary operations.

Each anti-pattern follows the same root cause: acting before diagnosing. The following case studies show what proper diagnosis looks like—starting from a confusing symptom and systematically narrowing to the dominant D·A·M axis.

D·A·M Case Studies

Theoretical constraints often manifest as confusing symptoms in production. These real-world scenarios illustrate how to apply the taxonomy. Each case follows the same three diagnostic moves: symptom, diagnosis, and fix.

Case 1: The starving accelerator (Data)

A team provisions a large A100 GPU instance to speed up training, but training time hardly improves and nvidia-smi shows GPU utilization fluctuating between 10 percent and 40 percent. The accelerator is not the binding resource: the data path cannot supply the machine fast enough, so the workload is I/O bound even though the visible symptom is slow training. The useful intervention is upstream of the model: optimize the extract, transform, load (ETL) path by moving from raw JPEGs with heavy CPU decoding to TFRecords or WebDataset with sequential reads, increasing data loader parallelism, and prefetching batches into accelerator memory.

Case 2: The latency cliff (Algorithm)

A real-time recommendation service fails to meet a 20 ms latency service-level agreement (SLA), while accelerator utilization is low and the batch size is one. That signal does not point to a saturated chip. It points to an algorithm that is too deep for the sequential deadline. Adding more hardware does not remove the serial layer path that dominates latency, so the effective fixes change the algorithmic work itself: quantization reduces memory fetch time by moving to INT8 where accuracy allows, pruning removes redundant heads or channels, and knowledge distillation trains a smaller student model to approximate the larger model’s behavior.

Case 3: The compute wall (Machine)

Accelerator utilization is pinned at 99 percent, memory bandwidth remains unsaturated, and training is stable but takes three weeks. Here the data path is doing its job: the system has kept the accelerator fed, and the workload is compute bound. The next step has to change the machine term in the iron law. The team can scale up from an A100 to an H100-class accelerator, scale out by distributing training across multiple accelerators through data parallelism, or lower precision from FP32/TF32 to BF16 where numerically safe. On NVIDIA Tensor Core paths, BF16 peak throughput is typically about 2\(\times\) TF32 peak, with realized speedup depending on kernels and bottlenecks.

Taken together, the three cases show why the same utilization number can imply different next steps depending on loss behavior, batch size, and memory pressure.

Checkpoint 1.1: D·A·M diagnosis check
  1. A training job shows 95 percent accelerator utilization but loss has plateaued for two epochs. Which D·A·M axis should you investigate, and why?
  2. Your colleague suggests adding more data loader workers to a job where nvidia-smi shows 98 percent GPU utilization. Using the iron law, explain why this will not help.
  3. An inference server meets its latency SLO at batch size 1 but fails at batch size 16. Which term in the iron law changed, and what does this tell you about the bottleneck regime?

These three cases illustrate clean, single-axis bottlenecks. Production incidents are rarely so tidy—symptoms often overlap, and the dominant axis can shift during debugging. The next section provides a systematic troubleshooting matrix for the messier scenarios encountered in practice.

Production Troubleshooting

Identifying the root cause of performance bottlenecks requires systematic elimination. Table 5 provides a diagnostic matrix for common failure modes observed in production deployments.

Table 5: D·A·M Diagnostic Matrix: Root cause identification and remediation strategies for common ML systems failures. Each row connects a user-visible symptom to the D·A·M axis most likely responsible, reducing the search space before a profiler is needed.
Symptom Likely D·A·M Culprit Diagnostic Question Recommended Action
Low Accelerator Utilization Data Is the data loader keeping up with the accelerator? Implement prefetching and use binary formats.
High Latency (P99) Algorithm Is the model depth or width exceeding the latency budget? Apply quantization (INT8) or structured pruning.
High Training Cost Machine Is the hardware utilization (\(\eta_{\text{hw}}\)) below 30%? Optimize CUDA kernels or use spot instances.
Silent Accuracy Drift Data Has the statistical distribution (\(P_t\)) shifted from \(P_0\)? Trigger retraining and update active learning filters.
Out-of-Memory (OOM) Algorithm/Machine Does the model state fit in available VRAM? Use gradient checkpointing or reduce batch size.

The diagnostic matrix indicates what to suspect. The next question is how to confirm that suspicion with evidence—which requires the right profiling tools.

Tooling Map

Once a hypothesis exists (for example, “the workload appears Machine-bound”), evidence is needed to confirm it. Abstract concepts must be measured with concrete utilities. Table 6 connects the theoretical components to the specific Linux and Python profiling tools that confirm them.

Table 6: D·A·M Tooling Map: Profiling utilities for diagnosing bottlenecks along each D·A·M axis. Start with the primary tool for quick triage; use secondary tools for deep-dive analysis when the primary tool’s output is inconclusive.
Axis Key Metric Primary Tool Secondary Tool
Data Batch Load Time tqdm (iterations/sec) iotop, dstat (Disk I/O)
Algorithm FLOPs, Model Depth PyTorch Profiler DeepSpeed Flops Profiler
Machine Accelerator Utilization, SM Occupancy nvidia-smi Nsight Compute, Nsight Systems

Profiling tools generate raw numbers—utilization percentages, FLOP counts, bandwidth measurements. Raw numbers, however, only become actionable when compared against a standard. The D·A·M Scorecard provides that standard: a set of efficiency thresholds that distinguish healthy systems from those that need intervention.

D·A·M Scorecard

To move beyond qualitative guessing, the efficiency ratios in table 7 grade a system’s performance against its theoretical limit. This report-card view standardizes what “good” looks like, anchored by MFU3—the single most important metric for large-scale training.

3 MFU (Model FLOPs Utilization): The ratio of achieved model FLOP/s to the hardware’s theoretical peak FLOP/s, introduced in the PaLM paper (Chowdhery et al. 2022). Unlike raw accelerator utilization (which counts any work the accelerator performs), MFU measures model computation rate relative to peak hardware throughput, so non-model work and system overhead are not counted as achieved model FLOPs. Benchmarking covers MFU in depth.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al. 2022. “PaLM: Scaling Language Modeling with Pathways.” arXiv Preprint arXiv:2204.02311.
Table 7: The D·A·M Efficiency Rubric: These three numbers characterize any ML system’s maturity. A system that passes all three thresholds has exhausted its easy optimizations; further gains require architectural changes or hardware upgrades.
Axis Metric Definition Failing Grade Passing Grade
Data I/O Overhead \(\frac{\text{Data Wait Time}}{\text{Total Step Time}}\) \(>\) 10% \(<\) 1%
Algorithm Active Params \(\frac{\text{Nonzero Params}}{\text{Total Params}}\) 100% (Dense) \(<\) 50% (Sparse)
Machine MFU \(\frac{\text{Achieved model FLOP/s}}{\text{Peak FLOP/s}}\) \(<\) 30% \(>\) 50%

The Scorecard and the Roofline Model both answer efficiency questions, but at different scales. The Scorecard grades the current system against known thresholds. Scaling laws and the information roofline address a more strategic question: what happens as the system scales beyond its current size?

Scaling Laws vs. Roofline

Systems engineering requires distinguishing between growth trajectories and fundamental limits.

Scaling laws (the journey)

Scaling laws4 are empirical power laws that predict how fast model performance improves as we increase resources. The two landmark results are Kaplan Scaling (Kaplan et al. 2020), which showed that performance improves predictably with parameter count (\(P\)), data (\(D\)), and total operations (\(O\)), and Chinchilla Scaling (Hoffmann et al. 2022), which refined this insight by defining the optimal ratio of these resources (for example, \(D \approx 20P\) tokens per parameter).

4 Scaling Laws: Empirical relationships, typically power laws of the form \(\mathcal{L}(x) \propto x^{-\alpha}\), that predict model loss as a function of dataset size, parameter count, or compute budget. Kaplan et al. (2020) studied these relationships for neural language models at OpenAI; Hoffmann et al. (2022) later refined the compute-optimal training trade-off with the Chinchilla result. Model Training discusses scaling laws in detail.

Kaplan, J., S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. 2020. “Scaling Laws for Neural Language Models.” ArXiv Preprint abs/2001.08361.
Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” Advances in Neural Information Processing Systems 35 35: 30016–30. https://doi.org/10.52202/068431-2176.

As economic guides, these laws summarize the trade-off this way: “If I double my compute budget, my error rate should drop by \(X\) percent.” They assume the information is there to be learned.

Information roofline (the destination)

The information roofline is the theoretical limit of what can be learned from the data, regardless of scale. Three quantities define it: the ceiling is the Bayes Error Rate5 (the irreducible error inherent in the data); the slope is the information density, or signal-to-noise ratio, of the training distribution; and the bottleneck appears when data has low information density—as with noisy financial tickers—causing the system to hit the “Data Quality Wall” long before the compute wall.

5 Bayes Error Rate: The lowest achievable error rate for any classifier on a given data distribution, determined by the overlap between class-conditional distributions (Goodfellow et al. 2016). Named after Thomas Bayes (1701–1761). No amount of data, parameters, or compute can reduce error below this theoretical floor.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

The diagnostic lesson is this: Scaling laws predict the slope of improvement, while the information roofline predicts the ceiling. If a loss curve flattens before the scaling law prediction, the system has hit the information roofline. Adding more accelerators (Machine) or parameters (Algorithm) at this point is futile; Data quality is the only lever left. This distinction closes the loop on the D·A·M taxonomy. Whether the task is debugging a single training step (iron law), evaluating hardware utilization (Roofline), or planning a multi-million-dollar scaling campaign (scaling laws), the diagnostic question is always the same: which axis dominates, and what lever moves it?

Summary

The D·A·M taxonomy provides a systematic framework for diagnosing ML systems bottlenecks. Each axis maps to a distinct physical constraint: Data is bounded by bandwidth, Algorithm by total operations, and Machine by peak throughput. The iron law quantifies these constraints, enabling systematic diagnosis. Use arithmetic intensity to determine the Data/Machine boundary, and the D·A·M Scorecard to evaluate system maturity. In practice, this sequence turns diagnosis into a short set of first questions.

Key Takeaways: Where to look first
  • Every bottleneck lives in one of three places: Data, Algorithm, or Machine. Identify the dominant axis before optimizing.
  • Profile arithmetic intensity before optimizing: Arithmetic intensity determines whether the regime is Data-bound or Machine-bound.
  • Diagnose the binding axis first: Optimizing the wrong term yields zero improvement.
  • Grade with the scorecard: Use I/O Overhead < 1 percent, Active Params < 50 percent, and MFU > 50 percent before investing in optimizations.
Back to top