models.types.SparseTransformerWorkload

models.types.SparseTransformerWorkload()

Sparse Transformer / Mixture-of-Experts workload type.

parameters represents total resident parameters and therefore memory pressure. active_parameters represents the parameters used per token and therefore the first-order compute path. experts and active_experts_per_token describe the routing structure used by MoERoutingModel and expert-parallel communication estimates.

Key Fields

Name Type Description
parameters Quantity Total model parameters.
active_parameters Quantity Parameters active per token.
experts int Total number of experts.
active_experts_per_token int Top-k experts selected per token.
layers int Transformer layer count.
hidden_dim int Hidden dimension used for activation and routing-volume estimates.

Example

from mlsysim import SparseTransformerWorkload, ureg

moe = SparseTransformerWorkload(
    name="Toy-MoE-64B",
    architecture="Sparse Transformer",
    parameters=64e9 * ureg.count,
    active_parameters=8e9 * ureg.count,
    experts=8,
    active_experts_per_token=2,
    layers=32,
    hidden_dim=4096,
    heads=32,
)
Back to top