The Fleet Zoo
Vetted Nodes, Racks, Fabrics, and Multi-Node Clusters
The Fleet Zoo defines the Structural Context of ML systems—from single microcontrollers to warehouse-scale supercomputers. Systems entries compose hardware nodes, rack profiles, network fabrics, and cluster counts into complete structures that the DistributedModel, sustainability models, and book LEGO cells can analyze.
A Fleet = Node × Count + Fabric. The Node specifies which accelerator and how many per server box. A RackProfile specifies a physical rack composition and power envelope. The Fabric specifies how nodes talk to each other (NVLink for intra-node, InfiniBand for inter-node). Use the 3D Parallelism tutorial to see how these parameters affect training time.
Reference Nodes
| Node Name | Accelerator | Count | Host Memory | Intra-Node BW |
|---|---|---|---|---|
| DGX A100 | NVIDIA A100 | 8 | — | 300.0 GB/s |
| DGX B200 | NVIDIA B200 | 8 | — | 900.0 GB/s |
| DGX H100 | NVIDIA H100 | 8 | 2 TB | 450.0 GB/s |
| Kempner H100 4-GPU Server | NVIDIA H100 | 4 | — | 450.0 GB/s |
Reference Racks
| Rack Name | Node Type | Nodes | Accelerators | Accelerator Power | Support Power | Total Power |
|---|---|---|---|---|---|---|
| DGX H100 4-node rack | DGX H100 | 4 | 32 | 22400 W | 11.1 kW | 33500.0 W |
Production Clusters
| Cluster Name | Node Type | Node Count | Total GPUs | Fabric |
|---|---|---|---|---|
| Frontier Cluster (8192 GPUs) | DGX H100 | 1024 | 8192 | IB NDR |
| Kempner AI Cluster H100 Partition (384 H100 GPUs) | Kempner H100 4-GPU Server | 96 | 384 | IB NDR |
| Lab Cluster (64 H100 GPUs) | DGX H100 | 8 | 64 | IB HDR |
| Mega Cluster (100000 GPUs) | DGX H100 | 12500 | 100000 | IB NDR |
| Production Cluster (2048 GPUs) | DGX H100 | 256 | 2048 | IB HDR |
| Reference H100 Cluster (25000 GPUs) | DGX H100 | 3125 | 25000 | IB NDR |
| Research Cluster (256 GPUs) | DGX H100 | 32 | 256 | 100GbE |
| Training Cluster (10000 GPUs) | DGX H100 | 1250 | 10000 | IB NDR |
| Training Cluster (1024 GPUs) | DGX H100 | 128 | 1024 | IB HDR |
| Training Cluster (1024 A100 GPUs) | DGX A100 | 128 | 1024 | IB HDR |
| Training Cluster (512 H100 GPUs) | DGX H100 | 64 | 512 | IB HDR |
How to Read the Fleet Zoo
Why Fleet Size Matters
Distributed training performance is dominated by communication overhead in large-scale systems such as DistBelief [1]. As you add more nodes, each all-reduce synchronization step must transfer gradient data across the fabric. The DistributedModel models this trade-off using the ring all-reduce cost model from Patarasuk and Yuan’s bandwidth-optimal all-reduce analysis [2]:
\[T_{\text{dp}} = 2(N-1) \cdot \left(\frac{M/N}{BW} + L\right)\]
where \(N\) is the GPU count, \(M\) is the model size in bytes, and \(BW\) is the fabric bandwidth.
Rack Profiles Are System Facts
Rack profiles are composed system facts, not one-off arithmetic. A rack table may explain the component breakdown, but the aggregate rack profile stays in Systems.Racks.* so every chapter uses the same node count, accelerator count, support power, and total power.
The Fabric Matters More Than You Think
Compare the two clusters above: both use DGX H100 nodes, but one uses 100 GbE while the other uses InfiniBand NDR. The 20x bandwidth difference dramatically changes scaling efficiency. Try both in the Scaling to 1000 GPUs tutorial to see the effect.
Deployment Tiers
The deployment tiers (Cloud, Edge, Mobile, Tiny) define the resource envelope for different deployment scenarios. Each tier specifies a RAM budget, storage capacity, and latency target — constraints that determine which models are feasible.
Textbook Connection
The Distributed Training and Collective Communication chapters use fleet configurations to analyze scaling efficiency and communication bottlenecks. The Compute Infrastructure chapter covers the hardware composition of these clusters.
Note: For cluster MTBF and bisection bandwidth details, see the Fleet API Reference.