The Fleet Zoo

Vetted Nodes, Racks, Fabrics, and Multi-Node Clusters

The Fleet Zoo defines the Structural Context of ML systems—from single microcontrollers to warehouse-scale supercomputers. Systems entries compose hardware nodes, rack profiles, network fabrics, and cluster counts into complete structures that the DistributedModel, sustainability models, and book LEGO cells can analyze.

Understanding System Hierarchy

A Fleet = Node × Count + Fabric. The Node specifies which accelerator and how many per server box. A RackProfile specifies a physical rack composition and power envelope. The Fabric specifies how nodes talk to each other (NVLink for intra-node, InfiniBand for inter-node). Use the 3D Parallelism tutorial to see how these parameters affect training time.

Reference Nodes

Node Name	Accelerator	Count	Host Memory	Intra-Node BW
DGX A100	NVIDIA A100	8	—	300.0 GB/s
DGX B200	NVIDIA B200	8	—	900.0 GB/s
DGX H100	NVIDIA H100	8	2 TB	450.0 GB/s
Kempner H100 4-GPU Server	NVIDIA H100	4	—	450.0 GB/s

Reference Racks

Rack Name	Node Type	Nodes	Accelerators	Accelerator Power	Support Power	Total Power
DGX H100 4-node rack	DGX H100	4	32	22400 W	11.1 kW	33500.0 W

Production Clusters

Cluster Name	Node Type	Node Count	Total GPUs	Fabric
Frontier Cluster (8192 GPUs)	DGX H100	1024	8192	IB NDR
Kempner AI Cluster H100 Partition (384 H100 GPUs)	Kempner H100 4-GPU Server	96	384	IB NDR
Lab Cluster (64 H100 GPUs)	DGX H100	8	64	IB HDR
Mega Cluster (100000 GPUs)	DGX H100	12500	100000	IB NDR
Production Cluster (2048 GPUs)	DGX H100	256	2048	IB HDR
Reference H100 Cluster (25000 GPUs)	DGX H100	3125	25000	IB NDR
Research Cluster (256 GPUs)	DGX H100	32	256	100GbE
Training Cluster (10000 GPUs)	DGX H100	1250	10000	IB NDR
Training Cluster (1024 GPUs)	DGX H100	128	1024	IB HDR
Training Cluster (1024 A100 GPUs)	DGX A100	128	1024	IB HDR
Training Cluster (512 H100 GPUs)	DGX H100	64	512	IB HDR

How to Read the Fleet Zoo

Why Fleet Size Matters

Distributed training performance is dominated by communication overhead in large-scale systems such as DistBelief [1]. As you add more nodes, each all-reduce synchronization step must transfer gradient data across the fabric. The DistributedModel models this trade-off using the ring all-reduce cost model from Patarasuk and Yuan’s bandwidth-optimal all-reduce analysis [2]:

\[T_{\text{dp}} = 2(N-1) \cdot \left(\frac{M/N}{BW} + L\right)\]

where \(N\) is the GPU count, \(M\) is the model size in bytes, and \(BW\) is the fabric bandwidth.

Rack Profiles Are System Facts

Rack profiles are composed system facts, not one-off arithmetic. A rack table may explain the component breakdown, but the aggregate rack profile stays in Systems.Racks.* so every chapter uses the same node count, accelerator count, support power, and total power.

The Fabric Matters More Than You Think

Compare the two clusters above: both use DGX H100 nodes, but one uses 100 GbE while the other uses InfiniBand NDR. The 20x bandwidth difference dramatically changes scaling efficiency. Try both in the Scaling to 1000 GPUs tutorial to see the effect.

Deployment Tiers

The deployment tiers (Cloud, Edge, Mobile, Tiny) define the resource envelope for different deployment scenarios. Each tier specifies a RAM budget, storage capacity, and latency target — constraints that determine which models are feasible.

Textbook Connection

The Distributed Training and Collective Communication chapters use fleet configurations to analyze scaling efficiency and communication bottlenecks. The Compute Infrastructure chapter covers the hardware composition of these clusters.

Note: For cluster MTBF and bisection bandwidth details, see the Fleet API Reference.

References

[1]

J. Dean, G. S. Corrado, R. Monga, et al., “Large scale distributed deep networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.

[2]

P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,” Journal of Parallel and Distributed Computing, vol. 69, no. 2, pp. 117–124, 2009, doi: 10.1016/j.jpdc.2008.09.002.