core.solver.WeightStreamingModel

core.solver.WeightStreamingModel()

Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming.

Instead of holding weights in HBM and streaming activations (the GPU Memory Wall), this architecture holds massive activation batches on-wafer (SRAM) and streams the model weights from external MemoryX nodes.

The bottleneck shifts from Memory Bandwidth to Injection Interconnect Bandwidth.

Literature Source: 1. Lie et al. (2022), “Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning.”

Methods

Name Description
solve Solves for throughput under Weight Streaming physics.

solve

core.solver.WeightStreamingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
)

Solves for throughput under Weight Streaming physics.

Parameters

Name Type Description Default
model TransformerWorkload The LLM model architecture. required
hardware HardwareNode The wafer-scale hardware (e.g., Cerebras CS-3). required
seq_len int Sequence length for KV cache sizing. required
batch_size int Number of sequences processed concurrently. 1
precision str Numerical format (fp16, int8, int4). 'fp16'
efficiency float Compute utilization efficiency (0.0 to 1.0). 0.5

Returns

Name Type Description
WeightStreamingResult Feasibility, throughput (tokens/s), bottleneck (compute vs. interconnect), layer timing, optimal batch size, and SRAM utilization.
Back to top