The $9M Question

K=8 reasoning steps multiply your serving bill by 7.6x — a simple algorithm choice becomes a capital decision.

ops
intermediate
Use InferenceScalingModel and EconomicsModel to quantify how chain-of-thought reasoning multiplies infrastructure cost from $1.2M to $9.1M annually.

The Question

Your team wants to add chain-of-thought (CoT) reasoning to the production serving pipeline. The accuracy improvement is clear: K=8 reasoning steps measurably improve answer quality on hard queries. But what does K=8 cost? Not in tokens — in dollars, GPUs, and annual infrastructure budget. Is this an algorithm decision or a capital expenditure decision?

NotePrerequisites

Complete Tutorial 2: Two Phases, One Request and Tutorial 3: The KV Cache Wall. You should understand TTFT, ITL, and the two-phase serving model.

NoteWhat You Will Learn
  • Calculate per-query latency and energy cost across reasoning depths K=1 to K=16
  • Compose InferenceScalingModel with EconomicsModel for annualized fleet cost
  • Quantify the cost multiplier of K=8 reasoning at 100 QPS fleet scale
  • Evaluate a routing strategy that sends easy queries to a cheap model and hard queries to an expensive reasoning model
TipBackground: Inference-Time Compute Scaling

Standard LLM inference generates one answer directly: prefill the prompt (TTFT), then decode tokens one at a time (ITL). Chain-of-thought reasoning changes this: the model generates K intermediate “thinking” steps, each producing dozens of tokens, before the final answer. The cost model becomes:

\[T_{\text{reason}} = \text{TTFT} + K \times T_{\text{step}}\]

where \(T_{\text{step}} = \text{tokens\_per\_step} \times \text{ITL}\). Each step is memory-bound (decoding), so the cost scales linearly with K but at the decode rate — the expensive, memory-bandwidth-limited phase. A seemingly algorithmic choice (add more reasoning) translates directly into GPU-hours and dollars.


1. Setup

import mlsysim
from mlsysim import Engine

2. Baseline Serving: Single-Query Cost

First, establish the baseline: a GPT-4 scale model served on an H100, no reasoning. This gives us the per-query TTFT and ITL that everything else builds on.

from mlsysim import Models, Hardware, ServingModel
from mlsysim.show import table, info

model = Models.GPT4
hardware = Hardware.Cloud.H100

serving = ServingModel()
baseline = serving.solve(
    model=model, hardware=hardware,
    seq_len=2048, batch_size=1, precision="fp16"
)

info("Baseline Serving",
     Model=f"{model.name} ({model.parameters.to('Gcount'):.0f} params)",
     Hardware=hardware.name,
     TTFT=baseline.ttft.to('ms'),
     ITL=baseline.itl.to('ms'),
     Memory_Used=f"{baseline.memory_utilization:.0%}",
     Feasible=f"{baseline.feasible}")
── Baseline Serving ────────────────────────
Model:        GPT-4 (1760 gigacount params)
Hardware:     NVIDIA H100
TTFT:         14,578.3 ms
ITL:          1,073.0 ms
Memory Used:  4138%
Feasible:     False

The ITL — the per-token decode latency — is the critical number. Every reasoning step generates dozens of tokens at this rate. That is where the cost accumulates.


3. CoT Sweep: K=1 to K=16

Now sweep reasoning depth using the InferenceScalingModel. Each step generates tokens of intermediate reasoning. Watch how the cost multiplier grows.

from mlsysim import InferenceScalingModel

cot_solver = InferenceScalingModel()
K_values = [1, 4, 8, 16]

baseline_time = None
results = {}
rows = []

for K in K_values:
    result = cot_solver.solve(
        model=model, hardware=hardware,
        reasoning_steps=K, context_length=2048, precision="fp16"
    )
    total_ms = result.total_reasoning_time.to("ms").magnitude
    energy_j = result.energy_per_query.to("J").magnitude

    if baseline_time is None:
        baseline_time = total_ms

    multiplier = total_ms / baseline_time
    results[K] = result

    rows.append([K, f"{total_ms:.1f}ms", result.tokens_generated, f"{energy_j:.1f} J", f"{multiplier:.1f}x"])

table(["K", "Total Time", "Tokens", "Energy (J)", "Multiplier"], rows)
K   Total Time  Tokens  Energy (J)  Multiplier
──────────────────────────────────────────────
1    68228.6ms      50   47760.0 J        1.0x
4   229179.3ms     200  160425.5 J        3.4x
8   443780.4ms     400  310646.3 J        6.5x
16  872982.4ms     800  611087.7 J       12.8x

K=8 does not cost exactly 8x the baseline. The actual multiplier reflects the structure of the cost: one TTFT (fixed) plus K decode phases (scaling). Since TTFT is a small fraction of total time for high K, the multiplier approaches K as K grows.


4. The $9M Question: Annualized Fleet Cost

Per-query cost is interesting. Fleet-level cost is what matters. Let’s compute the annual infrastructure cost of serving 100 queries per second at K=1 vs. K=8.

from mlsysim import EconomicsModel
from mlsysim.systems.types import Fleet, Node, NetworkFabric
from mlsysim.core.constants import Q_

econ = EconomicsModel()
target_qps = 100

fleet_results = {}
fleet_objects = {}
rows = []
for K in K_values:
    r = results[K]
    qt_s = r.total_reasoning_time.to("s").magnitude

    # Each GPU serves one query at a time (batch_size=1)
    qps_per_gpu = 1.0 / qt_s if qt_s > 0 else 0
    gpus_needed = int(target_qps / qps_per_gpu) + 1

    # Build fleet: 8 GPUs per node
    fleet = Fleet(
        name=f"K={K} Serving",
        node=Node(
            name="H100 Node",
            accelerator=hardware,
            accelerators_per_node=8,
            intra_node_bw=Q_("900 GB/s"),
        ),
        count=max((gpus_needed + 7) // 8, 1),
        fabric=NetworkFabric(
            name="IB NDR",
            bandwidth=Q_("400 Gbps").to("GB/s"),
        )
    )

    tco = econ.solve(fleet=fleet, duration_days=365)
    fleet_results[K] = tco
    fleet_objects[K] = fleet

    rows.append([K, f"{qt_s * 1000:.1f}ms", f"{qps_per_gpu:.2f}", fleet.total_accelerators, f"${tco.tco_usd:,.0f}"])

table(["K", "Query (ms)", "QPS/GPU", "GPUs", "Annual TCO ($)"], rows)
K   Query (ms)  QPS/GPU   GPUs  Annual TCO ($)
──────────────────────────────────────────────
1    68228.6ms     0.01   6824    $220,579,937
4   229179.3ms     0.00  22920    $740,869,307
8   443780.4ms     0.00  44384  $1,434,674,665
16  872982.4ms     0.00  87304  $2,822,026,788

The jump from K=1 to K=8 is not just a latency increase — it propagates through the entire infrastructure stack: more GPUs, more power, more cooling, more network fabric, more capital expenditure.

ImportantKey Insight

A seemingly algorithmic decision — “add more reasoning steps” — is actually an infrastructure spending decision. K=8 chain-of-thought reasoning multiplies per-query latency by approximately 7-8x, which means you need 7-8x more GPUs to maintain the same QPS. Annual TCO scales proportionally. The decision to add CoT reasoning is not a model architecture choice — it is a capital expenditure decision that belongs in the CFO’s budget, not just the ML engineer’s notebook.


5. The Routing Argument

The $9M annual TCO reframes the conversation from model architecture to capital planning. But it also raises an obvious question: must every query pay the full reasoning cost?

In production, the answer is no — and the optimization is routing. Smart routing sends easy queries to a fast, cheap model and only routes hard queries to the expensive reasoning pipeline. Let’s model a 70/30 split.

# Scenario: 70% of queries go to Llama-3 70B (no reasoning, K=1)
# 30% go to GPT-4 with K=8 reasoning
model_cheap = Models.Llama3_70B

# Cheap path: Llama-3 70B, K=1
r_cheap = cot_solver.solve(
    model=model_cheap, hardware=hardware,
    reasoning_steps=1, context_length=2048, precision="fp16"
)

# Expensive path: GPT-4, K=8 (already computed)
r_expensive = results[8]

# Weighted average query time
qt_cheap = r_cheap.total_reasoning_time.to("s").magnitude
qt_expensive = r_expensive.total_reasoning_time.to("s").magnitude
qt_blended = 0.70 * qt_cheap + 0.30 * qt_expensive

# Compare: all queries to GPT-4 K=8 vs. routed
qps_blended = 1.0 / qt_blended if qt_blended > 0 else 0
gpus_blended = int(target_qps / qps_blended) + 1

fleet_routed = Fleet(
    name="Routed Serving",
    node=Node(name="H100", accelerator=hardware,
              accelerators_per_node=8, intra_node_bw=Q_("900 GB/s")),
    count=max((gpus_blended + 7) // 8, 1),
    fabric=NetworkFabric(name="IB", bandwidth=Q_("400 Gbps").to("GB/s")),
)

tco_routed = econ.solve(fleet=fleet_routed, duration_days=365)
tco_all_k8 = fleet_results[8]

savings = tco_all_k8.tco_usd - tco_routed.tco_usd
pct_savings = savings / tco_all_k8.tco_usd * 100 if tco_all_k8.tco_usd > 0 else 0

table(
    ["Strategy", "GPUs", "Annual TCO ($)"],
    [
        ["All queries -> GPT-4 K=8", fleet_objects[8].total_accelerators, f"${tco_all_k8.tco_usd:,.0f}"],
        ["70/30 routed", fleet_routed.total_accelerators, f"${tco_routed.tco_usd:,.0f}"],
    ]
)
info(Savings=f"${savings:,.0f} ({pct_savings:.0f}%)")
Strategy                   GPUs  Annual TCO ($)
───────────────────────────────────────────────
All queries -> GPT-4 K=8  44384  $1,434,674,665
70/30 routed              13536    $437,539,570
Savings:  $997,135,095 (70%)

Routing reduces the fleet but does not eliminate the infrastructure commitment. The 30% of queries that still need full reasoning require dedicated GPU capacity, and the routing classifier itself introduces latency and complexity. The $9M is the ceiling; routing contains it but does not make the capital planning question go away.


Your Turn

CautionExercises

Exercise 1: Predict before you compute. If K=8 costs approximately 7.6x the baseline, predict: does K=16 cost exactly 16x? More? Less? Write your reasoning (consider the fixed TTFT cost), then check the actual numbers from the sweep table. Explain the gap.

Exercise 2: Replace H100 with B200. Use Hardware.Cloud.B200 (roughly 2x the memory bandwidth of H100) and re-run the K=8 analysis. Predict first: will the absolute cost multiplier (K=8 vs. K=1) be the same, larger, or smaller on the B200? Will the fleet size for 100 QPS change? Run the numbers and explain.

Exercise 3: At what K does 70B + reasoning exceed GPT-4 + no reasoning? Use Models.Llama3_70B with increasing K values and Models.GPT4 with K=1. Find the K value at which the 70B model’s per-query latency exceeds GPT-4’s K=1 latency. What does this tell you about the trade-off between model size and reasoning depth?

Self-check: If ITL is 5ms and each reasoning step generates 50 tokens, what is the total decode time for K=8? (Answer: 8 x 50 x 5ms = 2000ms = 2 seconds of pure decode time, plus TTFT.)


Key Takeaways

TipSummary
  • K reasoning steps multiply per-query latency by approximately K (slightly less due to fixed TTFT)
  • The cost multiplier propagates through the stack: more latency means more GPUs, more power, more cost
  • Annual TCO scales linearly with fleet size: K=8 reasoning can turn a $1M serving bill into $9M
  • Routing is the production answer: send easy queries to cheap models, hard queries to expensive reasoning
  • Algorithm choices are infrastructure decisions: adding CoT reasoning belongs in the budget planning process

Next Steps

Back to top