The Fleet Zoo

Vetted System Archetypes and Multi-Node Clusters

The Fleet Zoo defines the Structural Context of ML systems—from single microcontrollers to warehouse-scale supercomputers. Fleets combine hardware nodes, network fabric, and a count to form a complete system that the DistributedModel can analyze.

NoteUnderstanding System Hierarchy

A Fleet = Node × Count + Fabric. The Node specifies which accelerator and how many per server box. The Fabric specifies how nodes talk to each other (NVLink for intra-node, InfiniBand for inter-node). Use the 3D Parallelism tutorial to see how these parameters affect training time.

Deployment Tiers

Tier RAM Storage Latency Budget
Cloud 512 GB 10 TB 200 ms
Edge 32 GB 1 TB 50 ms
Mobile 8 GB 256 GB 30 ms
TinyML 512 KiB 4 MB 100 ms

Reference Nodes

Node Name Accelerator Count Intra-Node BW
DGX A100 NVIDIA A100 8 600.0 GB/s
DGX B200 NVIDIA B200 8 1800.0 GB/s
DGX H100 NVIDIA H100 8 900.0 GB/s

Production Clusters

Cluster Name Node Type Node Count Total GPUs Fabric
Frontier Cluster (8192 GPUs) DGX H100 1024 8192 IB NDR
Mega Cluster (100000 GPUs) DGX H100 12500 100000 IB NDR
Production Cluster (2048 GPUs) DGX H100 256 2048 IB HDR
Research Cluster (256 GPUs) DGX H100 32 256 100GbE

How to Read the Fleet Zoo

Why Fleet Size Matters

Distributed training performance is dominated by communication overhead. As you add more nodes, each all-reduce synchronization step must transfer gradient data across the fabric. The DistributedModel models this trade-off using the ring all-reduce formula [1]:

\[T_{\text{dp}} = 2(N-1) \cdot \left(\frac{M/N}{BW} + L\right)\]

where \(N\) is the GPU count, \(M\) is the model size in bytes, and \(BW\) is the fabric bandwidth.

The Fabric Matters More Than You Think

Compare the two clusters above: both use DGX H100 nodes, but one uses 100 GbE while the other uses InfiniBand NDR. The 20x bandwidth difference dramatically changes scaling efficiency. Try both in the Scaling to 1000 GPUs tutorial to see the effect.

Deployment Tiers

The deployment tiers (Cloud, Edge, Mobile, Tiny) define the resource envelope for different deployment scenarios. Each tier specifies a RAM budget, storage capacity, and latency target — constraints that determine which models are feasible.

Textbook Connection

The Distributed Training and Collective Communication chapters use fleet configurations to analyze scaling efficiency and communication bottlenecks. The Compute Infrastructure chapter covers the hardware composition of these clusters.


Note: For cluster MTBF and bisection bandwidth details, see the Fleet API Reference.

Back to top

References

[1]
J. Dean, G. S. Corrado, R. Monga, et al., “Large scale distributed deep networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.