solvers.InferenceScalingModel

solvers.InferenceScalingModel()

Models inference-time compute scaling (Wall 12: Reasoning/CoT Cost).

This model quantifies the cost of ‘System-2 thinking’ — inference-time compute scaling via chain-of-thought (CoT) reasoning, where the model generates K intermediate reasoning steps before producing the final answer. Each step incurs the full cost of autoregressive decoding.

Literature Source: 1. Wei et al. (2022), “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” 2. Snell et al. (2024), “Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.” 3. OpenAI (2024), “Learning to Reason with LLMs.” (o1 reasoning model.)

Methods

Name Description
solve Solves for inference-time reasoning cost.

solve

solvers.InferenceScalingModel.solve(
    model,
    hardware,
    reasoning_steps=8,
    context_length=2048,
    precision='fp16',
    efficiency=0.5,
)

Solves for inference-time reasoning cost.

Parameters

Name Type Description Default
model TransformerWorkload The language model used for reasoning. required
hardware HardwareNode The target hardware node. required
reasoning_steps int Number of reasoning steps K (each generates tokens). 8
context_length int Input context length in tokens. 2048
precision str Numerical precision. 'fp16'
efficiency float Compute efficiency factor (0.0 to 1.0). 0.5

Returns

Name Type Description
Dict[str, Any] Total reasoning time, cost per query, and token counts.
Back to top