The Fleet Zoo

Vetted Nodes, Racks, Fabrics, and Multi-Node Clusters

The Fleet Zoo defines the Structural Context of ML systems—from single microcontrollers to warehouse-scale supercomputers. Systems entries compose hardware nodes, rack profiles, network fabrics, and cluster counts into complete structures that the DistributedModel, sustainability models, and book LEGO cells can analyze.

NoteUnderstanding System Hierarchy

A Fleet = Node × Count + Fabric. The Node specifies which accelerator and how many per server box. A RackProfile specifies a physical rack composition and power envelope. The Fabric specifies how nodes talk to each other (NVLink for intra-node, InfiniBand for inter-node). Use the 3D Parallelism tutorial to see how these parameters affect training time.

Reference Nodes

Node Name Accelerator Count Host Memory Intra-Node BW
DGX A100 NVIDIA A100 8 300.0 GB/s
DGX B200 NVIDIA B200 8 900.0 GB/s
DGX H100 NVIDIA H100 8 2 TB 450.0 GB/s
Kempner H100 4-GPU Server NVIDIA H100 4 450.0 GB/s

Reference Racks

Rack Name Node Type Nodes Accelerators Accelerator Power Support Power Total Power
DGX H100 4-node rack DGX H100 4 32 22400 W 11.1 kW 33500.0 W

Production Clusters

Cluster Name Node Type Node Count Total GPUs Fabric
Frontier Cluster (8192 GPUs) DGX H100 1024 8192 IB NDR
Kempner AI Cluster H100 Partition (384 H100 GPUs) Kempner H100 4-GPU Server 96 384 IB NDR
Lab Cluster (64 H100 GPUs) DGX H100 8 64 IB HDR
Mega Cluster (100000 GPUs) DGX H100 12500 100000 IB NDR
Production Cluster (2048 GPUs) DGX H100 256 2048 IB HDR
Reference H100 Cluster (25000 GPUs) DGX H100 3125 25000 IB NDR
Research Cluster (256 GPUs) DGX H100 32 256 100GbE
Training Cluster (10000 GPUs) DGX H100 1250 10000 IB NDR
Training Cluster (1024 GPUs) DGX H100 128 1024 IB HDR
Training Cluster (1024 A100 GPUs) DGX A100 128 1024 IB HDR
Training Cluster (512 H100 GPUs) DGX H100 64 512 IB HDR

How to Read the Fleet Zoo

Why Fleet Size Matters

Distributed training performance is dominated by communication overhead in large-scale systems such as DistBelief [1]. As you add more nodes, each all-reduce synchronization step must transfer gradient data across the fabric. The DistributedModel models this trade-off using the ring all-reduce cost model from Patarasuk and Yuan’s bandwidth-optimal all-reduce analysis [2]:

\[T_{\text{dp}} = 2(N-1) \cdot \left(\frac{M/N}{BW} + L\right)\]

where \(N\) is the GPU count, \(M\) is the model size in bytes, and \(BW\) is the fabric bandwidth.

Rack Profiles Are System Facts

Rack profiles are composed system facts, not one-off arithmetic. A rack table may explain the component breakdown, but the aggregate rack profile stays in Systems.Racks.* so every chapter uses the same node count, accelerator count, support power, and total power.

The Fabric Matters More Than You Think

Compare the two clusters above: both use DGX H100 nodes, but one uses 100 GbE while the other uses InfiniBand NDR. The 20x bandwidth difference dramatically changes scaling efficiency. Try both in the Scaling to 1000 GPUs tutorial to see the effect.

Deployment Tiers

The deployment tiers (Cloud, Edge, Mobile, Tiny) define the resource envelope for different deployment scenarios. Each tier specifies a RAM budget, storage capacity, and latency target — constraints that determine which models are feasible.

Textbook Connection

The Distributed Training and Collective Communication chapters use fleet configurations to analyze scaling efficiency and communication bottlenecks. The Compute Infrastructure chapter covers the hardware composition of these clusters.


Note: For cluster MTBF and bisection bandwidth details, see the Fleet API Reference.

Back to top

References

[1]
J. Dean, G. S. Corrado, R. Monga, et al., “Large scale distributed deep networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.
[2]
P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,” Journal of Parallel and Distributed Computing, vol. 69, no. 2, pp. 117–124, 2009, doi: 10.1016/j.jpdc.2008.09.002.