Starving the GPU

Your GPU can process 5,300 images per second. Your CPU decodes 850.

data

intermediate

Discover that the data pipeline — not the GPU — is often the binding constraint in training. Use DataModel and TransformationModel to find the crossover where CPU preprocessing stalls the accelerator.

The Question

You launch ResNet-50 training on an A100 and watch nvidia-smi. GPU utilization reads 40%. You expected 95%. The model is compute-bound. The hardware is top-tier. Why is your GPU sitting idle 60% of the time?

The answer is almost never the model or the GPU. It is the invisible pipeline upstream: JPEG decoding, random cropping, color jitter, and normalization — all running on the CPU. When the CPU cannot prepare batches fast enough, the GPU starves.

Prerequisites

Complete Tutorial 0: Hello, Roofline and Tutorial 1: The Memory Wall. You should understand memory-bound vs. compute-bound regimes and how Engine.solve reports bottlenecks.

What You Will Learn

Measure the GPU’s step time in isolation using SingleNodeModel
Calculate the data pipeline’s throughput using DataModel and TransformationModel
Identify the batch size crossover where the CPU becomes the binding constraint
Predict how many CPU workers are needed to eliminate the data bottleneck

Background: The Three Stages of a Training Step

Every training step has three sequential stages. The slowest one determines your actual throughput — not the GPU alone:

Storage I/O (Wall 8) — Read raw data from disk into CPU memory
CPU Preprocessing (Wall 9) — Decode, resize, augment, normalize
Accelerator Compute (Wall 1) — Forward pass, backward pass, weight update

The GPU cannot start until stages 1 and 2 finish. If either is slower than the GPU, the accelerator utilization drops below 100%. This is the data pipeline bottleneck.

1. Setup

import mlsysim
from mlsysim.solvers import SingleNodeModel, DataModel
from mlsysim.solvers import TransformationModel

2. GPU Compute Time: The Ceiling You Think You Have

We switch from LLM serving (Tutorials 2–3) to CNN training because the data pipeline bottleneck is most visible here. LLM training on tokenized text has a tiny data footprint (~8 MB/s as we will see in Tutorial 12). Image training with JPEG decoding, resizing, and augmentation can demand 10–100× more CPU work per sample — this is where the GPU actually starves.

First, establish how fast the A100 processes a ResNet-50 training step in isolation — no data loading, no preprocessing, just pure compute:

from mlsysim.solvers import SingleNodeModel
from mlsysim.core.units import Q_
from mlsysim.show import table, info

model = mlsysim.Models.Vision.ResNet50
hardware = mlsysim.Hardware.Cloud.A100
solver = SingleNodeModel()

# Baseline: ResNet-50 on A100, batch 256, FP16
profile = solver.solve(model=model, hardware=hardware, batch_size=256, precision="fp16")

info("GPU Compute Baseline",
     Model=model.name,
     Hardware=hardware.name,
     Batch_size=256,
     Step_latency=profile.latency.to('ms'),
     Throughput=f"{profile.throughput:.0f} img/s",
     Bottleneck=profile.bottleneck)

── GPU Compute Baseline ────────────────────
Model:         ResNet-50
Hardware:      NVIDIA A100
Batch size:    256
Step latency:  13.97 ms
Throughput:    18323 / second img/s
Bottleneck:    Compute

The GPU can process this batch in tens of milliseconds. That is the ceiling. Now let’s check whether the data pipeline can keep up.

3. Storage I/O Check: Can the Disk Deliver?

ImageNet images average ~500 KB each (JPEG compressed). At batch 256, the GPU demands a burst of data every step. Can the storage subsystem supply it?

from mlsysim.solvers import DataModel

sample_size = Q_("500 KB")  # Average ImageNet JPEG
batch_size = 256

# Data demand = batch_size x sample_size / step_time
step_time_s = profile.latency.to("s").magnitude
data_per_step = (batch_size * sample_size.to("GB")).magnitude
demand_rate = Q_(data_per_step / step_time_s, "GB/s")

data_solver = DataModel()
data_result = data_solver.solve(workload_data_rate=demand_rate, hardware=hardware)

info("Storage I/O Check",
     Data_demand=f"{demand_rate:.3f}",
     Storage_supply=f"{data_result.supply_bw:.2f}",
     Utilization=f"{data_result.utilization:.1%}",
     Is_stalled=data_result.is_stalled)

── Storage I/O Check ───────────────────────
Data demand:     9.162 GB / second
Storage supply:  32.00 GB / second
Utilization:     28.6%
Is stalled:      False

Storage I/O is fine — modern NVMe SSDs can deliver multi-GB/s easily. The bottleneck is not reading the bytes. It is transforming them.

4. The Reveal: CPU Preprocessing Is the Wall

Even with fast storage, the CPU must decode JPEGs, apply random crops, color jitter, and normalization. A typical CPU worker processes ImageNet images at ~250 MB/s. With 8 workers, total CPU throughput is ~2 GB/s:

from mlsysim.solvers import TransformationModel

transform_solver = TransformationModel()
cpu_throughput = Q_("2 GB/s")  # 8 workers x 250 MB/s each

t = transform_solver.solve(
    batch_size=256,
    sample_size_bytes=sample_size,
    cpu_throughput=cpu_throughput,
    accelerator_step_time=profile.latency
)

info("CPU vs GPU Pipeline",
     CPU_transform_time=t.transform_time,
     GPU_step_time=t.accelerator_step_time,
     CPU_is_bottleneck=t.is_bottleneck,
     GPU_utilization=f"{t.accelerator_utilization:.1%}",
     Slowdown_factor=f"{t.slowdown_factor:.2f}x")

── CPU vs GPU Pipeline ─────────────────────
CPU transform time:  64 ms
GPU step time:       13.97 ms
CPU is bottleneck:   True
GPU utilization:     21.8%
Slowdown factor:     4.58x

Key Insight

The binding constraint is not silicon — it is JPEG decoding on the CPU. The data pipeline (Wall 9: Transformation) becomes the bottleneck before the GPU (Wall 1: Compute). Your GPU can process 5,300+ images per second, but your 8 CPU workers can only prepare ~850. The GPU sits idle waiting for data. This is why production training pipelines use GPU-accelerated preprocessing (NVIDIA DALI), pre-decoded datasets, or aggressive prefetching.

5. Batch Size Sweep: Finding the Crossover

Let’s sweep batch sizes to find exactly where the CPU becomes the binding constraint. At small batches, the GPU is slower and data arrives in time. At large batches, the GPU becomes more efficient but the CPU falls behind:

rows = []
for bs in [32, 64, 128, 256, 512, 1024]:
    p = solver.solve(model=model, hardware=hardware, batch_size=bs, precision="fp16")

    t = transform_solver.solve(
        batch_size=bs,
        sample_size_bytes=sample_size,
        cpu_throughput=cpu_throughput,
        accelerator_step_time=p.latency
    )

    binding = "Transformation" if t.is_bottleneck else p.bottleneck
    rows.append([
        bs,
        f"{p.latency.to('ms').magnitude:.2f} ms",
        f"{t.transform_time.to('ms').magnitude:.2f} ms",
        binding,
        f"{t.accelerator_utilization:.1%}"
    ])

table(["Batch", "GPU Step", "CPU Xform", "Binding", "GPU Util"], rows)

Batch  GPU Step  CPU Xform         Binding  GPU Util
────────────────────────────────────────────────────
32      2.20 ms    8.00 ms  Transformation     27.5%
64      3.88 ms   16.00 ms  Transformation     24.2%
128     7.24 ms   32.00 ms  Transformation     22.6%
256    13.97 ms   64.00 ms  Transformation     21.8%
512    27.43 ms  128.00 ms  Transformation     21.4%
1024   54.34 ms  256.00 ms  Transformation     21.2%

Watch the crossover: at small batch sizes the GPU is the bottleneck (100% utilization). As batch size grows, CPU preprocessing time grows linearly while GPU step time grows sub-linearly. Eventually Wall 9 becomes the binding constraint and GPU utilization drops.

6. The Fix: Adding CPU Workers

The simplest fix for a CPU bottleneck is more workers. Let’s compare 8 vs. 16 vs. 32:

rows = []
for n_workers in [8, 16, 32]:
    cpu_tp = Q_(f"{n_workers * 250} MB/s")

    p = solver.solve(model=model, hardware=hardware, batch_size=512, precision="fp16")

    t = transform_solver.solve(
        batch_size=512,
        sample_size_bytes=sample_size,
        cpu_throughput=cpu_tp,
        accelerator_step_time=p.latency
    )

    rows.append([n_workers, cpu_tp.to('GB/s'), f"{t.accelerator_utilization:.1%}"])

table(["Workers", "Throughput", "GPU Util @ bs=512"], rows)

Workers  Throughput  GPU Util @ bs=512
──────────────────────────────────────
8            2 GB/s              21.4%
16           4 GB/s              42.9%
32           8 GB/s              85.7%

Doubling workers doubles throughput — but you eventually hit either storage I/O limits (Wall 8) or PCIe bandwidth. The takeaway: always check all three stages of the pipeline.

Your Turn

Exercises

Exercise 1: Predict before you compute. At batch size 64 with 8 CPU workers (2 GB/s total), will ResNet-50 training on the A100 be GPU-bound or CPU-bound? Write your prediction, then run the code. What determines the answer? (Hint: compare transform_time vs. accelerator_step_time.)

Exercise 2: Medical imaging — larger samples. Medical imaging uses images 10x larger than ImageNet (~5 MB per sample). Change sample_size to Q_("5 MB") and re-run the batch size sweep. At what batch size does the CPU stall the GPU now? How many workers would you need to keep up at batch 256?

Exercise 3: GPU-accelerated preprocessing. If you use NVIDIA DALI to move preprocessing to the GPU, the CPU bottleneck effectively disappears. Model this by setting cpu_throughput = Q_("50 GB/s"). Run the sweep again. Does the bottleneck shift back to compute? What is the new GPU utilization at batch 512?

Self-check: If the GPU step takes 20 ms and CPU preprocessing takes 35 ms, what is the accelerator utilization? (Answer: 20/35 = 57%.)

Key Takeaways

Summary

Data pipelines have three stages: storage I/O, CPU preprocessing, and GPU compute — the slowest determines throughput
CPU preprocessing (Wall 9) is the most common bottleneck: JPEG decode, augmentation, and normalization are all CPU-bound
Batch size shifts the binding constraint: small batches are GPU-bound; large batches often become CPU-bound
Adding CPU workers helps linearly but has diminishing returns when storage I/O becomes the limit
Always check all three stages before concluding that the GPU is the bottleneck

Next Steps

Quantization: Not a Free Lunch — When reducing precision helps (and when it doesn’t)
KV-Cache: The Hidden Tax — Another hidden memory consumer: the KV-cache in LLM serving
Where to Invest — Use sensitivity analysis to decide whether more CPU workers or a faster GPU is the better investment
Silicon Zoo — Compare storage and interconnect specs across GPU platforms