solvers.EfficiencyModel
solvers.EfficiencyModel()Models the gap between peak and achieved FLOPS (Wall 3: Software Efficiency).
This model quantifies the software efficiency of a workload — the fraction of peak hardware FLOPS that the software stack actually converts into useful computation. It decomposes Model FLOPs Utilization (MFU) by workload type, accounting for kernel fusion efficiency, SM occupancy, and memory access patterns.
Literature Source: 1. Chowdhery et al. (2022), “PaLM: Scaling Language Modeling with Pathways.” (First systematic MFU reporting for large Transformers.) 2. Dao et al. (2022), “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” (FlashAttention MFU improvement.) 3. NVIDIA (2023), “Hopper Architecture Tuning Guide.” (SM Occupancy model.)
Methods
| Name | Description |
|---|---|
| solve | Estimates achievable MFU and FLOPS for a given workload type. |
solve
solvers.EfficiencyModel.solve(
model,
hardware,
workload_type='ffn',
use_flash_attention=False,
precision='fp16',
efficiency=0.5,
)Estimates achievable MFU and FLOPS for a given workload type.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | Workload | The model architecture to simulate. | required |
| hardware | HardwareNode | The target hardware node. | required |
| workload_type | str | The dominant kernel type (‘attention’, ‘ffn’, ‘conv’). | 'ffn' |
| use_flash_attention | bool | Whether FlashAttention is enabled (only applies to ‘attention’). | False |
| precision | str | Numerical precision (‘fp16’, ‘fp32’, ‘int8’, ‘int4’). | 'fp16' |
| efficiency | float | Base compute efficiency factor (0.0 to 1.0). | 0.5 |
Returns
| Name | Type | Description |
|---|---|---|
| Dict[str, Any] | MFU estimate, achievable FLOPS, and overhead breakdown. |