solvers.WeightStreamingModel

solvers.WeightStreamingModel()

Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming.

Instead of holding weights in HBM and streaming activations (the GPU Memory Wall), this architecture holds massive activation batches on-wafer (SRAM) and streams the model weights from external MemoryX nodes.

The bottleneck shifts from Memory Bandwidth to Injection Interconnect Bandwidth.

Literature Source: 1. Lie et al. (2022), “Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning.”

Methods

Name Description
solve Simulates Weight Streaming throughput and SRAM feasibility.

solve

solvers.WeightStreamingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
    phase='decode',
)

Simulates Weight Streaming throughput and SRAM feasibility.

Parameters

Name Type Description Default
phase str Inference phase: ‘prefill’ or ‘decode’ (default ‘decode’). - prefill: processes all S tokens in parallel (compute-heavy, O(S^2) attention) - decode: processes one token at a time per request (memory-bound) 'decode'
Back to top