solvers.WeightStreamingModel
solvers.WeightStreamingModel()Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming.
Instead of holding weights in HBM and streaming activations (the GPU Memory Wall), this architecture holds massive activation batches on-wafer (SRAM) and streams the model weights from external MemoryX nodes.
The bottleneck shifts from Memory Bandwidth to Injection Interconnect Bandwidth.
Literature Source: 1. Lie et al. (2022), “Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning.”
Methods
| Name | Description |
|---|---|
| solve | Simulates Weight Streaming throughput and SRAM feasibility. |
solve
solvers.WeightStreamingModel.solve(
model,
hardware,
seq_len,
batch_size=1,
precision='fp16',
efficiency=0.5,
phase='decode',
)Simulates Weight Streaming throughput and SRAM feasibility.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| phase | str | Inference phase: ‘prefill’ or ‘decode’ (default ‘decode’). - prefill: processes all S tokens in parallel (compute-heavy, O(S^2) attention) - decode: processes one token at a time per request (memory-bound) | 'decode' |