The Fleet Zoo
Vetted System Archetypes and Multi-Node Clusters
The Fleet Zoo defines the Structural Context of ML systems—from single microcontrollers to warehouse-scale supercomputers. Fleets combine hardware nodes, network fabric, and a count to form a complete system that the DistributedModel can analyze.
A Fleet = Node × Count + Fabric. The Node specifies which accelerator and how many per server box. The Fabric specifies how nodes talk to each other (NVLink for intra-node, InfiniBand for inter-node). Use the 3D Parallelism tutorial to see how these parameters affect training time.
Deployment Tiers
| Tier | RAM | Storage | Latency Budget |
|---|---|---|---|
| Cloud | 512 GB | 10 TB | 200 ms |
| Edge | 32 GB | 1 TB | 50 ms |
| Mobile | 8 GB | 256 GB | 30 ms |
| TinyML | 512 KiB | 4 MB | 100 ms |
Reference Nodes
| Node Name | Accelerator | Count | Intra-Node BW |
|---|---|---|---|
| DGX A100 | NVIDIA A100 | 8 | 600.0 GB/s |
| DGX B200 | NVIDIA B200 | 8 | 1800.0 GB/s |
| DGX H100 | NVIDIA H100 | 8 | 900.0 GB/s |
Production Clusters
| Cluster Name | Node Type | Node Count | Total GPUs | Fabric |
|---|---|---|---|---|
| Frontier Cluster (8192 GPUs) | DGX H100 | 1024 | 8192 | IB NDR |
| Mega Cluster (100000 GPUs) | DGX H100 | 12500 | 100000 | IB NDR |
| Production Cluster (2048 GPUs) | DGX H100 | 256 | 2048 | IB HDR |
| Research Cluster (256 GPUs) | DGX H100 | 32 | 256 | 100GbE |
How to Read the Fleet Zoo
Why Fleet Size Matters
Distributed training performance is dominated by communication overhead. As you add more nodes, each all-reduce synchronization step must transfer gradient data across the fabric. The DistributedModel models this trade-off using the ring all-reduce formula [1]:
\[T_{\text{dp}} = 2(N-1) \cdot \left(\frac{M/N}{BW} + L\right)\]
where \(N\) is the GPU count, \(M\) is the model size in bytes, and \(BW\) is the fabric bandwidth.
The Fabric Matters More Than You Think
Compare the two clusters above: both use DGX H100 nodes, but one uses 100 GbE while the other uses InfiniBand NDR. The 20x bandwidth difference dramatically changes scaling efficiency. Try both in the Scaling to 1000 GPUs tutorial to see the effect.
Deployment Tiers
The deployment tiers (Cloud, Edge, Mobile, Tiny) define the resource envelope for different deployment scenarios. Each tier specifies a RAM budget, storage capacity, and latency target — constraints that determine which models are feasible.
Textbook Connection
The Distributed Training and Collective Communication chapters use fleet configurations to analyze scaling efficiency and communication bottlenecks. The Compute Infrastructure chapter covers the hardware composition of these clusters.
Note: For cluster MTBF and bisection bandwidth details, see the Fleet API Reference.