GPU vs. Wafer-Scale

Cerebras eliminates the memory wall — then hits a completely different one.

analysis
advanced
Compare conventional GPU inference to Cerebras weight-streaming silicon. The binding constraint shifts from HBM bandwidth to injection bandwidth — a qualitative regime change, not just a quantitative improvement.

The Question

Can a fundamentally different architecture change which wall binds? GPUs are weight-stationary: weights live in HBM, and the bottleneck is HBM bandwidth. The Cerebras WSE-3 takes the opposite approach: it is activation-stationary, holding activations on 44 GB of on-wafer SRAM and streaming weights from external MemoryX nodes. Does this eliminate the memory wall — or just move it somewhere else?

NotePrerequisites

Complete Tutorial 0: Hello, Roofline, Tutorial 1: The Memory Wall, and Tutorial 9: Sensitivity Analysis. You should understand roofline analysis, binding constraints, and sensitivity-based investment decisions.

NoteWhat You Will Learn
  • Compare GPU and Cerebras architectures on the same workload using different solvers
  • Identify that the binding constraint shifts from HBM bandwidth to injection bandwidth
  • Compute the optimal batch size B* where injection and compute overlap perfectly
  • Explain why this is a qualitative regime change, not just a quantitative speedup
TipBackground: Two Philosophies of Memory

Conventional GPUs use a two-level memory hierarchy: fast but small on-chip SRAM (registers, L1/L2 cache) and large but slower off-chip HBM. The fundamental insight of wafer-scale computing is: what if you made the chip large enough that SRAM alone could hold the working set? The Cerebras WSE-3 is an entire silicon wafer — 46,225 mm² vs. ~800 mm² for an H100 die — with 44 GB of on-wafer SRAM distributed across 900,000 cores.

GPU (weight-stationary): Model weights live in HBM. At each decode step, the entire model streams from HBM to the compute units. Activations are small and transient. Bottleneck: HBM bandwidth.

Cerebras WSE-3 (activation-stationary): Activations and KV-cache live on the 44 GB of on-wafer SRAM. But 44 GB cannot hold a 350 GB model, so weights must stream in layer-by-layer from external MemoryX nodes — dedicated memory boxes connected to the wafer via a high-bandwidth interconnect. Bottleneck: injection bandwidth from MemoryX.

Same model, same math, completely different performance physics.


1. Setup

import mlsysim
from mlsysim import SingleNodeModel, WeightStreamingModel, SensitivitySolver

2. GPU Baseline: H100 Inference

We use GPT-3 (175B) — a model large enough that architectural differences in how weights reach compute become the dominant factor. At batch size 1, each decode step must reload the entire model from HBM.

from mlsysim import SingleNodeModel, WeightStreamingModel, SensitivitySolver
from mlsysim.show import table, info, banner

model = mlsysim.Models.Language.GPT3
gpu_hw = mlsysim.Hardware.Cloud.H100

gpu_solver = SingleNodeModel()
gpu_result = gpu_solver.solve(
    model=model, hardware=gpu_hw,
    batch_size=1, precision="fp16"
)

info("GPU Baseline",
     Model=f"{model.name} ({model.parameters.to('Gparam'):.0f})",
     Hardware=gpu_hw.name,
     Bottleneck=gpu_result.bottleneck,
     Latency=gpu_result.latency.to('ms'),
     HBM_BW=gpu_hw.memory.bandwidth.to('TB/s'),
     Peak_FLOPS=gpu_hw.compute.peak_flops.to('TFLOPs/s'))
── GPU Baseline ────────────────────────────
Model:       GPT-3 (175B) (175 gigaparam)
Hardware:    NVIDIA H100
Bottleneck:  Memory
Latency:     5,478.4 ms
HBM BW:      3.35 TB/s
Peak FLOPS:  989 TFLOPs/s

At batch size 1, GPT-3 requires 2 FLOPs per parameter per token but must load all 175B parameters (350 GB at fp16) from HBM. The arithmetic intensity is approximately 1 FLOP/byte — far below the H100’s ridge point. The 3.35 TB/s HBM bandwidth, not the 989 TFLOP/s compute, determines the decode latency.


3. Cerebras Path: Weight Streaming on WSE-3

Now analyze the same model on the Cerebras CS-3. Instead of loading weights from HBM, the WSE-3 streams them from MemoryX nodes over a dedicated interconnect.

ws_hw = mlsysim.Hardware.Cloud.Cerebras_CS3
ws_solver = WeightStreamingModel()

ws_result = ws_solver.solve(
    model=model, hardware=ws_hw,
    seq_len=2048, batch_size=1, precision="fp16"
)

info("Cerebras WSE-3",
     Hardware=ws_hw.name,
     Feasible=ws_result.feasible,
     Bottleneck=ws_result.bottleneck,
     Throughput=f"{ws_result.throughput_tokens_per_sec:.0f} tokens/sec",
     Layer_compute_time=ws_result.layer_compute_time.to('ms'),
     Layer_injection_time=ws_result.layer_injection_time.to('ms'),
     Optimal_batch_size=ws_result.optimal_batch_size,
     SRAM_utilization=f"{ws_result.wafer_memory_utilization:.1%}")
── Cerebras WSE-3 ──────────────────────────
Hardware:              Cerebras CS-3 (WSE-3)
Feasible:              True
Bottleneck:            Interconnect-Bandwidth-Bound
Throughput:            3 tokens/sec
Layer compute time:    1.06e-03 ms
Layer injection time:  3.04 ms
Optimal batch size:    52083
SRAM utilization:      24.2%

The WSE-3 reports two times per layer: how long the wafer takes to compute the layer’s output, and how long it takes to inject the layer’s weights from MemoryX. The bottleneck is whichever is slower.


4. Side-by-Side: Where the Wall Shifts

gpu_lat_ms = gpu_result.latency.to('ms').magnitude
# Cerebras total decode: max(inject, compute) per layer * num_layers
ws_layer_time = max(
    ws_result.layer_injection_time.to('ms').magnitude,
    ws_result.layer_compute_time.to('ms').magnitude
)
ws_total_ms = ws_layer_time * model.layers
speedup = gpu_lat_ms / ws_total_ms if ws_total_ms > 0 else 0

table(
    ["Metric", "H100 (GPU)", "CS-3 (WSE)"],
    [
        ["Bottleneck", gpu_result.bottleneck, ws_result.bottleneck],
        ["Total decode time (ms)", f"{gpu_lat_ms:.2f}", f"{ws_total_ms:.2f}"],
        ["Speedup", "1.0x", f"{speedup:.1f}x"],
        ["Optimal batch B*", "N/A", ws_result.optimal_batch_size],
    ]
)
Metric                  H100 (GPU)                    CS-3 (WSE)
────────────────────────────────────────────────────────────────
Bottleneck                  Memory  Interconnect-Bandwidth-Bound
Total decode time (ms)     5478.36                        291.67
Speedup                       1.0x                         18.8x
Optimal batch B*               N/A                         52083

The GPU and WSE-3 hit fundamentally different walls:

  • GPU: Limited by HBM bandwidth (~3.35 TB/s)
  • WSE-3: Limited by MemoryX injection bandwidth (~1.2 TB/s)

This means the optimization strategies are completely different. For the GPU, you optimize by reducing bytes loaded (quantization, smaller models). For the WSE-3, you optimize by overlapping injection with compute (increasing batch size toward B*).

ImportantKey Insight

The binding constraint is not a property of the model — it is a property of the model-architecture pair. GPUs are bound by HBM bandwidth. Cerebras WSE-3 eliminates the HBM wall entirely (weights never touch HBM) but introduces an injection bandwidth wall from MemoryX. This is a qualitative regime change: the wall shifted, it did not disappear. When evaluating any novel architecture, the question is not “is it faster?” but “which wall does it move, and what new wall does it create?”


5. The SRAM Ceiling: Finding B*

The WSE-3 has a unique optimization knob: batch size controls whether compute or injection dominates. At the optimal batch size B*, the two pipelines overlap perfectly. But activations must fit in 44 GB of on-wafer SRAM — this is the SRAM ceiling.

rows = []
for batch in [1, 2, 4, 8, 16, 32, 64, 128]:
    r = ws_solver.solve(
        model=model, hardware=ws_hw,
        seq_len=2048, batch_size=batch, precision="fp16"
    )
    rows.append([
        batch, r.bottleneck,
        f"{r.throughput_tokens_per_sec:.0f}/s",
        f"{r.wafer_memory_utilization:.1%}",
        "YES" if r.feasible else "OOM"
    ])

table(["Batch", "Bottleneck", "Throughput", "SRAM Util", "Feasible"], rows)
Batch                    Bottleneck  Throughput  SRAM Util  Feasible
────────────────────────────────────────────────────────────────────
1      Interconnect-Bandwidth-Bound         3/s      24.2%       YES
2      Interconnect-Bandwidth-Bound         7/s      48.3%       YES
4      Interconnect-Bandwidth-Bound        14/s      96.6%       YES
8      Interconnect-Bandwidth-Bound         0/s     193.3%       OOM
16     Interconnect-Bandwidth-Bound         0/s     386.5%       OOM
32     Interconnect-Bandwidth-Bound         0/s     773.1%       OOM
64     Interconnect-Bandwidth-Bound         0/s    1546.2%       OOM
128    Interconnect-Bandwidth-Bound         0/s    3092.4%       OOM

Watch for where the bottleneck transitions from injection-bound to compute-bound. At that transition (B), neither pipeline is idle, and throughput per token is maximized. Beyond B, SRAM fills up and the configuration eventually becomes infeasible (OOM).


6. Sensitivity Confirmation: Different Walls, Different Levers

Use the SensitivitySolver on the GPU to confirm that the binding constraint is bandwidth, then contrast with the Cerebras architecture conceptually.

sens_solver = SensitivitySolver()
gpu_sens = sens_solver.solve(
    model=model, hardware=gpu_hw, precision="fp16"
)

banner(f"GPU Sensitivity ({gpu_hw.name})")
info(Baseline_latency=gpu_sens.baseline_latency.to('ms'),
     Binding_constraint=gpu_sens.binding_constraint)

sens_rows = [[param, f"{val:+.4f}"] for param, val in gpu_sens.sensitivities.items()]
table(["Parameter", "Sensitivity"], sens_rows)

banner("Cerebras WSE-3")
info(Binding_constraint="injection bandwidth (MemoryX -> wafer)",
     Optimization_lever="increase batch size to overlap inject/compute")

print()
print("Different architectures -> different walls -> different strategies.")

=== GPU Sensitivity (NVIDIA H100) ===
Baseline latency:    5,478.4 ms
Binding constraint:  peak_flops
Parameter         Sensitivity
─────────────────────────────
peak_flops            +0.0000
memory_bandwidth      +0.0000
memory_capacity       +0.0000

=== Cerebras WSE-3 ===
Binding constraint:  injection bandwidth (MemoryX -> wafer)
Optimization lever:  increase batch size to overlap inject/compute

Different architectures -> different walls -> different strategies.
WarningThe deeper lesson

When evaluating novel architectures (wafer-scale, photonic, analog, neuromorphic), do not ask “Is it faster?” Ask: “Which wall does it move, and what new wall does it create?” Every architecture eliminates one bottleneck by introducing another.


Your Turn

CautionExercises

Exercise 1: Predict before you compute. Does the Cerebras advantage grow or shrink for smaller models? Before running code, predict whether the WSE-3 speedup over H100 will be larger or smaller for mlsysim.Models.Llama3_8B (8B parameters) compared to GPT-3 (175B). Then verify with both solvers. Explain your finding in terms of injection bandwidth utilization.

Exercise 2: The SRAM ceiling. At what model size does the 44 GB SRAM ceiling become the binding constraint on Cerebras? Try mlsysim.Models.Llama3_70B at increasing sequence lengths (512, 1024, 2048, 4096, 8192). At what point does SRAM utilization exceed 100% (OOM)? What does this mean for serving long-context models on wafer-scale silicon?

Exercise 3: TCO comparison. If an H100 costs ~$30,000 and a Cerebras CS-3 costs ~$2,000,000, how many H100s would you need to match the Cerebras throughput for GPT-3 inference? Use the throughput numbers from this tutorial to compute the fleet size, then compare the total hardware cost. Which is more cost-effective at 100 queries per second?

Self-check: If the WSE-3 injection bandwidth is 1.2 TB/s and GPT-3 weights are 350 GB (fp16), what is the minimum per-layer injection time for a 96-layer model?


Key Takeaways

TipSummary
  • Weight streaming inverts the GPU memory hierarchy: activations stay on-wafer (SRAM), weights stream in from external memory nodes
  • The binding constraint shifts from HBM bandwidth (GPU) to injection bandwidth (WSE-3) — a qualitative change in system physics
  • Optimal batch size B* exists for weight-streaming architectures, perfectly overlapping injection with compute
  • Architecture evaluation requires asking “which wall moves?” not “which is faster?”

Next Steps

Back to top