solvers.InferenceScalingModel
solvers.InferenceScalingModel()Models inference-time compute scaling (Wall 12: Reasoning/CoT Cost).
This model quantifies the cost of ‘System-2 thinking’ — inference-time compute scaling via chain-of-thought (CoT) reasoning, where the model generates K intermediate reasoning steps before producing the final answer. Each step incurs the full cost of autoregressive decoding.
Literature Source: 1. Wei et al. (2022), “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” 2. Snell et al. (2024), “Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.” 3. OpenAI (2024), “Learning to Reason with LLMs.” (o1 reasoning model.)
Methods
| Name | Description |
|---|---|
| solve | Solves for inference-time reasoning cost. |
solve
solvers.InferenceScalingModel.solve(
model,
hardware,
reasoning_steps=8,
context_length=2048,
precision='fp16',
efficiency=0.5,
)Solves for inference-time reasoning cost.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | TransformerWorkload | The language model used for reasoning. | required |
| hardware | HardwareNode | The target hardware node. | required |
| reasoning_steps | int | Number of reasoning steps K (each generates tokens). | 8 |
| context_length | int | Input context length in tokens. | 2048 |
| precision | str | Numerical precision. | 'fp16' |
| efficiency | float | Compute efficiency factor (0.0 to 1.0). | 0.5 |
Returns
| Name | Type | Description |
|---|---|---|
| Dict[str, Any] | Total reasoning time, cost per query, and token counts. |