core.solver.WeightStreamingModel

core.solver.WeightStreamingModel()

Analyzes Wafer-Scale inference (e.g., Cerebras CS-3) using Weight Streaming.

Instead of holding weights in HBM and streaming activations (the GPU Memory Wall), this architecture holds massive activation batches on-wafer (SRAM) and streams the model weights from external MemoryX nodes.

The bottleneck shifts from Memory Bandwidth to Injection Interconnect Bandwidth.

Literature Source: 1. Lie et al. (2022), “Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning.”

Methods

Name	Description
solve	Solves for throughput under Weight Streaming physics.

solve

core.solver.WeightStreamingModel.solve(
    model,
    hardware,
    seq_len,
    batch_size=1,
    precision='fp16',
    efficiency=0.5,
)

Solves for throughput under Weight Streaming physics.

Parameters

Name	Type	Description	Default
model	TransformerWorkload	The LLM model architecture.	required
hardware	HardwareNode	The wafer-scale hardware (e.g., Cerebras CS-3).	required
seq_len	int	Sequence length for KV cache sizing.	required
batch_size	int	Number of sequences processed concurrently.	`1`
precision	str	Numerical format (fp16, int8, int4).	`'fp16'`
efficiency	float	Compute utilization efficiency (0.0 to 1.0).	`0.5`

Returns

Name	Type	Description
	WeightStreamingResult	Feasibility, throughput (tokens/s), bottleneck (compute vs. interconnect), layer timing, optimal batch size, and SRAM utilization.