core.solver.WeightStreamingModel
core.solver.WeightStreamingModel()Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming.
Instead of holding weights in HBM and streaming activations (the GPU Memory Wall), this architecture holds massive activation batches on-wafer (SRAM) and streams the model weights from external MemoryX nodes.
The bottleneck shifts from Memory Bandwidth to Injection Interconnect Bandwidth.
Literature Source: 1. Lie et al. (2022), “Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning.”
Methods
| Name | Description |
|---|---|
| solve | Solves for throughput under Weight Streaming physics. |
solve
core.solver.WeightStreamingModel.solve(
model,
hardware,
seq_len,
batch_size=1,
precision='fp16',
efficiency=0.5,
)Solves for throughput under Weight Streaming physics.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | TransformerWorkload | The LLM model architecture. | required |
| hardware | HardwareNode | The wafer-scale hardware (e.g., Cerebras CS-3). | required |
| seq_len | int | Sequence length for KV cache sizing. | required |
| batch_size | int | Number of sequences processed concurrently. | 1 |
| precision | str | Numerical format (fp16, int8, int4). | 'fp16' |
| efficiency | float | Compute utilization efficiency (0.0 to 1.0). | 0.5 |
Returns
| Name | Type | Description |
|---|---|---|
| WeightStreamingResult | Feasibility, throughput (tokens/s), bottleneck (compute vs. interconnect), layer timing, optimal batch size, and SRAM utilization. |