The $9M Question

K=8 reasoning steps multiply your serving bill by 7.6x — a simple algorithm choice becomes a capital decision.

ops

intermediate

Use InferenceScalingModel and EconomicsModel to quantify how chain-of-thought reasoning multiplies infrastructure cost from $1.2M to $9.1M annually.

The Question

Your team wants to add chain-of-thought (CoT) reasoning to the production serving pipeline. The accuracy improvement is clear: K=8 reasoning steps measurably improve answer quality on hard queries. But what does K=8 cost? Not in tokens — in dollars, GPUs, and annual infrastructure budget. Is this an algorithm decision or a capital expenditure decision?

Prerequisites

Complete Tutorial 2: Two Phases, One Request and Tutorial 3: The KV Cache Wall. You should understand TTFT, ITL, and the two-phase serving model.

What You Will Learn

Calculate per-query latency and energy cost across reasoning depths K=1 to K=16
Compose InferenceScalingModel with EconomicsModel for annualized fleet cost
Quantify the cost multiplier of K=8 reasoning at 100 QPS fleet scale
Evaluate a routing strategy that sends easy queries to a cheap model and hard queries to an expensive reasoning model

Background: Inference-Time Compute Scaling

Standard LLM inference generates one answer directly: prefill the prompt (TTFT), then decode tokens one at a time (ITL). Chain-of-thought reasoning changes this: the model generates K intermediate “thinking” steps, each producing dozens of tokens, before the final answer. The cost model becomes:

\[T_{\text{reason}} = \text{TTFT} + K \times T_{\text{step}}\]

where $T_{\text{step}} = \text{tokens\_per\_step} \times \text{ITL}$. Each step is memory-bound (decoding), so the cost scales linearly with K but at the decode rate — the expensive, memory-bandwidth-limited phase. A seemingly algorithmic choice (add more reasoning) translates directly into GPU-hours and dollars.

1. Setup

import mlsysim
from mlsysim import Engine

2. Baseline Serving: Single-Query Cost

First, establish the baseline: a GPT-4 scale model served on an H100, no reasoning. This gives us the per-query TTFT and ITL that everything else builds on.

from mlsysim import Models, Hardware
from mlsysim.solvers import ServingModel
from mlsysim.show import table, info

model = Models.Language.GPT4
hardware = Hardware.Cloud.H100

serving = ServingModel()
baseline = serving.solve(
    model=model, hardware=hardware,
    seq_len=2048, batch_size=1, precision="fp16"
)

info("Baseline Serving",
     Model=f"{model.name} ({model.parameters.to('Gcount'):.0f} params)",
     Hardware=hardware.name,
     TTFT=baseline.ttft.to('ms'),
     ITL=baseline.itl.to('ms'),
     Memory_Used=f"{baseline.memory_utilization:.0%}",
     Feasible=f"{baseline.feasible}")

── Baseline Serving ────────────────────────
Model:        GPT-4 (1760 gigacount params)
Hardware:     NVIDIA H100
TTFT:         14,645.0 ms
ITL:          1,056.8 ms
Memory Used:  4117%
Feasible:     False

The ITL — the per-token decode latency — is the critical number. Every reasoning step generates dozens of tokens at this rate. That is where the cost accumulates.

3. CoT Sweep: K=1 to K=16

Now sweep reasoning depth using the InferenceScalingModel. Each step generates tokens of intermediate reasoning. Watch how the cost multiplier grows.

from mlsysim.solvers import InferenceScalingModel

cot_solver = InferenceScalingModel()
K_values = [1, 4, 8, 16]

baseline_time = None
results = {}
rows = []

for K in K_values:
    result = cot_solver.solve(
        model=model, hardware=hardware,
        reasoning_steps=K, context_length=2048, precision="fp16"
    )
    total_ms = result.total_reasoning_time.to("ms").magnitude
    energy_j = result.energy_per_query.to("J").magnitude

    if baseline_time is None:
        baseline_time = total_ms

    multiplier = total_ms / baseline_time
    results[K] = result

    rows.append([K, f"{total_ms:.1f}ms", result.tokens_generated, f"{energy_j:.1f} J", f"{multiplier:.1f}x"])

table(["K", "Total Time", "Tokens", "Energy (J)", "Multiplier"], rows)

K   Total Time  Tokens  Energy (J)  Multiplier
──────────────────────────────────────────────
1    67482.7ms      50   47237.9 J        1.0x
4   225995.8ms     200  158197.1 J        3.3x
8   437346.6ms     400  306142.6 J        6.5x
16  860048.2ms     800  602033.8 J       12.7x

K=8 does not cost exactly 8x the baseline. The actual multiplier reflects the structure of the cost: one TTFT (fixed) plus K decode phases (scaling). Since TTFT is a small fraction of total time for high K, the multiplier approaches K as K grows.

4. The $9M Question: Annualized Fleet Cost

Per-query cost is interesting. Fleet-level cost is what matters. Let’s compute the annual infrastructure cost of serving 100 queries per second at K=1 vs. K=8.

from mlsysim.solvers import EconomicsModel
from mlsysim.systems.types import Fleet, Node, NetworkFabric
from mlsysim.core.units import Q_

econ = EconomicsModel()
target_qps = 100

fleet_results = {}
fleet_objects = {}
rows = []
for K in K_values:
    r = results[K]
    qt_s = r.total_reasoning_time.to("s").magnitude

    # Each GPU serves one query at a time (batch_size=1)
    qps_per_gpu = 1.0 / qt_s if qt_s > 0 else 0
    gpus_needed = int(target_qps / qps_per_gpu) + 1

    # Build fleet: 8 GPUs per node
    fleet = Fleet(
        name=f"K={K} Serving",
        node=Node(
            name="H100 Node",
            accelerator=hardware,
            accelerators_per_node=8,
            intra_node_bw=Q_("900 GB/s"),
        ),
        count=max((gpus_needed + 7) // 8, 1),
        fabric=NetworkFabric(
            name="IB NDR",
            bandwidth=Q_("400 Gbps").to("GB/s"),
        )
    )

    tco = econ.solve(fleet=fleet, duration_days=365)
    fleet_results[K] = tco
    fleet_objects[K] = fleet

    rows.append([K, f"{qt_s * 1000:.1f}ms", f"{qps_per_gpu:.2f}", fleet.total_accelerators, f"${tco.tco_usd:,.0f}"])

table(["K", "Query (ms)", "QPS/GPU", "GPUs", "Annual TCO ($)"], rows)

K   Query (ms)  QPS/GPU   GPUs  Annual TCO ($)
──────────────────────────────────────────────
1    67482.7ms     0.01   6752     $70,271,265
4   225995.8ms     0.00  22600    $235,208,915
8   437346.6ms     0.00  43736    $455,181,289
16  860048.2ms     0.00  86008    $895,126,035

The jump from K=1 to K=8 is not just a latency increase — it propagates through the entire infrastructure stack: more GPUs, more power, more cooling, more network fabric, more capital expenditure.

Key Insight

A seemingly algorithmic decision — “add more reasoning steps” — is actually an infrastructure spending decision. K=8 chain-of-thought reasoning multiplies per-query latency by approximately 7-8x, which means you need 7-8x more GPUs to maintain the same QPS. Annual TCO scales proportionally. The decision to add CoT reasoning is not a model architecture choice — it is a capital expenditure decision that belongs in the CFO’s budget, not just the ML engineer’s notebook.

5. The Routing Argument

The $9M annual TCO reframes the conversation from model architecture to capital planning. But it also raises an obvious question: must every query pay the full reasoning cost?

In production, the answer is no — and the optimization is routing. Smart routing sends easy queries to a fast, cheap model and only routes hard queries to the expensive reasoning pipeline. Let’s model a 70/30 split.

# Scenario: 70% of queries go to Llama-3 70B (no reasoning, K=1)
# 30% go to GPT-4 with K=8 reasoning
model_cheap = Models.Language.Llama3_70B

# Cheap path: Llama-3 70B, K=1
r_cheap = cot_solver.solve(
    model=model_cheap, hardware=hardware,
    reasoning_steps=1, context_length=2048, precision="fp16"
)

# Expensive path: GPT-4, K=8 (already computed)
r_expensive = results[8]

# Weighted average query time
qt_cheap = r_cheap.total_reasoning_time.to("s").magnitude
qt_expensive = r_expensive.total_reasoning_time.to("s").magnitude
qt_blended = 0.70 * qt_cheap + 0.30 * qt_expensive

# Compare: all queries to GPT-4 K=8 vs. routed
qps_blended = 1.0 / qt_blended if qt_blended > 0 else 0
gpus_blended = int(target_qps / qps_blended) + 1

fleet_routed = Fleet(
    name="Routed Serving",
    node=Node(name="H100", accelerator=hardware,
              accelerators_per_node=8, intra_node_bw=Q_("900 GB/s")),
    count=max((gpus_blended + 7) // 8, 1),
    fabric=NetworkFabric(name="IB", bandwidth=Q_("400 Gbps").to("GB/s")),
)

tco_routed = econ.solve(fleet=fleet_routed, duration_days=365)
tco_all_k8 = fleet_results[8]

savings = tco_all_k8.tco_usd - tco_routed.tco_usd
pct_savings = savings / tco_all_k8.tco_usd * 100 if tco_all_k8.tco_usd > 0 else 0

table(
    ["Strategy", "GPUs", "Annual TCO ($)"],
    [
        ["All queries -> GPT-4 K=8", fleet_objects[8].total_accelerators, f"${tco_all_k8.tco_usd:,.0f}"],
        ["70/30 routed", fleet_routed.total_accelerators, f"${tco_routed.tco_usd:,.0f}"],
    ]
)
info(Savings=f"${savings:,.0f} ({pct_savings:.0f}%)")

Strategy                   GPUs  Annual TCO ($)
───────────────────────────────────────────────
All queries -> GPT-4 K=8  43736    $455,181,289
70/30 routed              13320    $138,627,555
Savings:  $316,553,733 (70%)

Routing reduces the fleet but does not eliminate the infrastructure commitment. The 30% of queries that still need full reasoning require dedicated GPU capacity, and the routing classifier itself introduces latency and complexity. The $9M is the ceiling; routing contains it but does not make the capital planning question go away.

Your Turn

Exercises

Exercise 1: Predict before you compute. If K=8 costs approximately 7.6x the baseline, predict: does K=16 cost exactly 16x? More? Less? Write your reasoning (consider the fixed TTFT cost), then check the actual numbers from the sweep table. Explain the gap.

Exercise 2: Replace H100 with B200. Use Hardware.Cloud.B200 (roughly 2x the memory bandwidth of H100) and re-run the K=8 analysis. Predict first: will the absolute cost multiplier (K=8 vs. K=1) be the same, larger, or smaller on the B200? Will the fleet size for 100 QPS change? Run the numbers and explain.

Exercise 3: At what K does 70B + reasoning exceed GPT-4 + no reasoning? Use Models.Language.Llama3_70B with increasing K values and Models.Language.GPT4 with K=1. Find the K value at which the 70B model’s per-query latency exceeds GPT-4’s K=1 latency. What does this tell you about the trade-off between model size and reasoning depth?

Self-check: If ITL is 5ms and each reasoning step generates 50 tokens, what is the total decode time for K=8? (Answer: 8 x 50 x 5ms = 2000ms = 2 seconds of pure decode time, plus TTFT.)

Key Takeaways

Summary

K reasoning steps multiply per-query latency by approximately K (slightly less due to fixed TTFT)
The cost multiplier propagates through the stack: more latency means more GPUs, more power, more cost
Annual TCO scales linearly with fleet size: K=8 reasoning can turn a $1M serving bill into $9M
Routing is the production answer: send easy queries to cheap models, hard queries to expensive reasoning
Algorithm choices are infrastructure decisions: adding CoT reasoning belongs in the budget planning process

Next Steps

Geography is a Systems Variable – See how location changes the carbon cost of your serving fleet
Scaling to 1000 GPUs – Discover reliability costs when training the models that do the reasoning
Quantization: Not a Free Lunch – Learn when reducing precision helps with serving costs
Sensitivity Analysis – Sweep parameters to find which lever matters most for your deployment