models.types.SparseTransformerWorkload
models.types.SparseTransformerWorkload()Sparse Transformer / Mixture-of-Experts workload type.
parameters represents total resident parameters and therefore memory pressure. active_parameters represents the parameters used per token and therefore the first-order compute path. experts and active_experts_per_token describe the routing structure used by MoERoutingModel and expert-parallel communication estimates.
Key Fields
| Name | Type | Description |
|---|---|---|
| parameters | Quantity | Total model parameters. |
| active_parameters | Quantity | Parameters active per token. |
| experts | int | Total number of experts. |
| active_experts_per_token | int | Top-k experts selected per token. |
| layers | int | Transformer layer count. |
| hidden_dim | int | Hidden dimension used for activation and routing-volume estimates. |
Example
from mlsysim import SparseTransformerWorkload, ureg
moe = SparseTransformerWorkload(
name="Toy-MoE-64B",
architecture="Sparse Transformer",
parameters=64e9 * ureg.count,
active_parameters=8e9 * ureg.count,
experts=8,
active_experts_per_token=2,
layers=32,
hidden_dim=4096,
heads=32,
)