K=8 reasoning steps multiply your serving bill by 7.6x — a simple algorithm choice becomes a capital decision.
ops
intermediate
Use InferenceScalingModel and EconomicsModel to quantify how chain-of-thought reasoning multiplies infrastructure cost from $1.2M to $9.1M annually.
The Question
Your team wants to add chain-of-thought (CoT) reasoning to the production serving pipeline. The accuracy improvement is clear: K=8 reasoning steps measurably improve answer quality on hard queries. But what does K=8 cost? Not in tokens — in dollars, GPUs, and annual infrastructure budget. Is this an algorithm decision or a capital expenditure decision?
Calculate per-query latency and energy cost across reasoning depths K=1 to K=16
ComposeInferenceScalingModel with EconomicsModel for annualized fleet cost
Quantify the cost multiplier of K=8 reasoning at 100 QPS fleet scale
Evaluate a routing strategy that sends easy queries to a cheap model and hard queries to an expensive reasoning model
TipBackground: Inference-Time Compute Scaling
Standard LLM inference generates one answer directly: prefill the prompt (TTFT), then decode tokens one at a time (ITL). Chain-of-thought reasoning changes this: the model generates K intermediate “thinking” steps, each producing dozens of tokens, before the final answer. The cost model becomes:
\[T_{\text{reason}} = \text{TTFT} + K \times T_{\text{step}}\]
where \(T_{\text{step}} = \text{tokens\_per\_step} \times \text{ITL}\). Each step is memory-bound (decoding), so the cost scales linearly with K but at the decode rate — the expensive, memory-bandwidth-limited phase. A seemingly algorithmic choice (add more reasoning) translates directly into GPU-hours and dollars.
1. Setup
import mlsysimfrom mlsysim import Engine
2. Baseline Serving: Single-Query Cost
First, establish the baseline: a GPT-4 scale model served on an H100, no reasoning. This gives us the per-query TTFT and ITL that everything else builds on.
The ITL — the per-token decode latency — is the critical number. Every reasoning step generates dozens of tokens at this rate. That is where the cost accumulates.
3. CoT Sweep: K=1 to K=16
Now sweep reasoning depth using the InferenceScalingModel. Each step generates tokens of intermediate reasoning. Watch how the cost multiplier grows.
K Total Time Tokens Energy (J) Multiplier
──────────────────────────────────────────────
1 68228.6ms 50 47760.0 J 1.0x
4 229179.3ms 200 160425.5 J 3.4x
8 443780.4ms 400 310646.3 J 6.5x
16 872982.4ms 800 611087.7 J 12.8x
K=8 does not cost exactly 8x the baseline. The actual multiplier reflects the structure of the cost: one TTFT (fixed) plus K decode phases (scaling). Since TTFT is a small fraction of total time for high K, the multiplier approaches K as K grows.
4. The $9M Question: Annualized Fleet Cost
Per-query cost is interesting. Fleet-level cost is what matters. Let’s compute the annual infrastructure cost of serving 100 queries per second at K=1 vs. K=8.
from mlsysim import EconomicsModelfrom mlsysim.systems.types import Fleet, Node, NetworkFabricfrom mlsysim.core.constants import Q_econ = EconomicsModel()target_qps =100fleet_results = {}fleet_objects = {}rows = []for K in K_values: r = results[K] qt_s = r.total_reasoning_time.to("s").magnitude# Each GPU serves one query at a time (batch_size=1) qps_per_gpu =1.0/ qt_s if qt_s >0else0 gpus_needed =int(target_qps / qps_per_gpu) +1# Build fleet: 8 GPUs per node fleet = Fleet( name=f"K={K} Serving", node=Node( name="H100 Node", accelerator=hardware, accelerators_per_node=8, intra_node_bw=Q_("900 GB/s"), ), count=max((gpus_needed +7) //8, 1), fabric=NetworkFabric( name="IB NDR", bandwidth=Q_("400 Gbps").to("GB/s"), ) ) tco = econ.solve(fleet=fleet, duration_days=365) fleet_results[K] = tco fleet_objects[K] = fleet rows.append([K, f"{qt_s *1000:.1f}ms", f"{qps_per_gpu:.2f}", fleet.total_accelerators, f"${tco.tco_usd:,.0f}"])table(["K", "Query (ms)", "QPS/GPU", "GPUs", "Annual TCO ($)"], rows)
The jump from K=1 to K=8 is not just a latency increase — it propagates through the entire infrastructure stack: more GPUs, more power, more cooling, more network fabric, more capital expenditure.
ImportantKey Insight
A seemingly algorithmic decision — “add more reasoning steps” — is actually an infrastructure spending decision. K=8 chain-of-thought reasoning multiplies per-query latency by approximately 7-8x, which means you need 7-8x more GPUs to maintain the same QPS. Annual TCO scales proportionally. The decision to add CoT reasoning is not a model architecture choice — it is a capital expenditure decision that belongs in the CFO’s budget, not just the ML engineer’s notebook.
5. The Routing Argument
The $9M annual TCO reframes the conversation from model architecture to capital planning. But it also raises an obvious question: must every query pay the full reasoning cost?
In production, the answer is no — and the optimization is routing. Smart routing sends easy queries to a fast, cheap model and only routes hard queries to the expensive reasoning pipeline. Let’s model a 70/30 split.
Routing reduces the fleet but does not eliminate the infrastructure commitment. The 30% of queries that still need full reasoning require dedicated GPU capacity, and the routing classifier itself introduces latency and complexity. The $9M is the ceiling; routing contains it but does not make the capital planning question go away.
Your Turn
CautionExercises
Exercise 1: Predict before you compute. If K=8 costs approximately 7.6x the baseline, predict: does K=16 cost exactly 16x? More? Less? Write your reasoning (consider the fixed TTFT cost), then check the actual numbers from the sweep table. Explain the gap.
Exercise 2: Replace H100 with B200. Use Hardware.Cloud.B200 (roughly 2x the memory bandwidth of H100) and re-run the K=8 analysis. Predict first: will the absolute cost multiplier (K=8 vs. K=1) be the same, larger, or smaller on the B200? Will the fleet size for 100 QPS change? Run the numbers and explain.
Exercise 3: At what K does 70B + reasoning exceed GPT-4 + no reasoning? Use Models.Llama3_70B with increasing K values and Models.GPT4 with K=1. Find the K value at which the 70B model’s per-query latency exceeds GPT-4’s K=1 latency. What does this tell you about the trade-off between model size and reasoning depth?
Self-check: If ITL is 5ms and each reasoning step generates 50 tokens, what is the total decode time for K=8? (Answer: 8 x 50 x 5ms = 2000ms = 2 seconds of pure decode time, plus TTFT.)
Key Takeaways
TipSummary
K reasoning steps multiply per-query latency by approximately K (slightly less due to fixed TTFT)
The cost multiplier propagates through the stack: more latency means more GPUs, more power, more cost
Annual TCO scales linearly with fleet size: K=8 reasoning can turn a $1M serving bill into $9M
Routing is the production answer: send easy queries to cheap models, hard queries to expensive reasoning
Algorithm choices are infrastructure decisions: adding CoT reasoning belongs in the budget planning process