Why Your 1000-GPU Cluster Might Be Slow: Peak FLOPS vs. Delivered Throughput

A buddy of mine was recently weeping into his overpriced craft beer because his team’s shiny new 1,000-GPU cluster was running at a snail’s pace. They spent millions to make Jensen Huang even richer, expecting instant AGI, and instead got a multimillion-dollar space heater and a front-row seat to cluster-wide deadlock.

In the AI infrastructure hype-cycle, we worship big numbers. We read marketing spec sheets promising thousands of TeraFLOPS and immediately assume our training runs will take minutes. We’re wrong.

Peak FLOPS is a theoretical fantasy. Delivered throughput is the cold, hard reality.

If you build your cluster on a house of cards, your expensive GPUs will spend 90% of their time waiting around. Let’s skip the marketing slides and look at why your multi-million dollar setup is running like a 2012 Raspberry Pi.

Clearing Up Terminology: FLOP vs. FLOPS

Before we dive in, let’s settle some terminology so you can annoy your teammates on Slack:

FLOP (Floating Point Operation): A single math operation. Adding (a + b) or multiplying (a * b) two decimal numbers.
FLOPS (Floating Point Operations Per Second): The speed of those operations.
The ‘s’ Confusion: In ML papers, “FLOPs” (lowercase ‘s’) usually means the static count of operations—like, “training this model takes $10^{23}$ FLOPs.” In systems engineering, “FLOPS” (uppercase ‘S’) is how fast we’re actually burning through them. If you mix these up, some compiler engineer who hasn’t shipped a line of production code since 2019 will definitely correct you.

Peak FLOPS: Horsepower in a Vacuum

Peak FLOPS is the number Nvidia’s sales team uses to convince your VP of Engineering to sign a cloud lock-in contract. It’s the maximum speed a chip can run if memory latency doesn’t exist, networks never fail, and code actually works.

Take a look at the spec sheet for an NVIDIA H100 SXM5. The peak numbers you get depend entirely on the precision you choose to run:

FP64 (Double Precision): 34 TFLOPS (or 67 TFLOPS using Tensor Cores)
FP32 (Single Precision): 67 TFLOPS
BF16/FP16 (Half Precision): 989 TFLOPS (or 1,979 TFLOPS with Tensor Cores)
FP8 (Quarter Precision): 1,979 TFLOPS (or 3,958 TFLOPS with structural sparsity enabled)

The gap is huge because lower-precision math uses much smaller physical silicon. An FP8 multiplier takes up about $1/16\text{th}$ of the physical space and power of an FP64 multiplier.

If you’re simulating weather patterns, you need FP64 or your rounding errors will cause a simulated category 5 hurricane in Nebraska. But for AI? Neural networks are dumb and resilient; they don’t care about a little noise. We train models in mixed-precision BF16/FP16 and run inference in FP8 or INT8, unlocking the fastest hardware paths.

The Roofline Model: Why Memory Starves Compute

If your GPUs are sitting at 15% utilization while your cloud bill screams past six figures, blame the Roofline Model. It maps your performance based on Arithmetic Intensity—how many math operations (FLOPs) you execute for every single byte of memory you fetch from HBM.

The model splits your workload into two depressing realities:

Memory-Bandwidth Bound (The Slope): If you’re doing minimal math per byte loaded—like vector addition, layer normalization, or LLM token decoding—your Tensor Cores are sitting idle. You are entirely bottlenecked by HBM bandwidth.
Compute-Bound (The Flat Peak): If you’re crunching heavy matrix math (dense GEMM operations in LLM training), you’re actually keeping the GPU busy.

But when you scale to 1,000 GPUs, the bottlenecks shift from local memory to your network fabric and storage layer, which is where things get truly expensive.

Interconnect Fabric: The Multi-GPU Tax

You can’t fit a modern LLM on one GPU, no matter how much you optimize. So you split your model across a cluster using tensor, pipeline, or data parallelism.

But at the end of every training step, all those GPUs have to synchronize using collective operations like AllReduce or ReduceScatter (managed by Nvidia’s proprietary, closed-source NCCL).

When NCCL fires, the computing stops. If you built your cluster using standard Ethernet because you thought you could save money on switches: prepare to cry.

The Network Bound: An H100 node needs 800 Gbps to 3.2 Tbps of non-blocking, bidirectional bandwidth. You must use dedicated fabrics like InfiniBand (Quantum-2) or RoCE v2 (RDMA over Converged Ethernet).
Tail Latency & Packet Loss: Standard TCP/IP drops packets under load. In a 1,000-GPU cluster, everyone runs at the speed of the slowest link. If a single packet gets lost on a cheap switch port, every single GPU halts.

Storage IOPS and the Data Pipeline

If your dataloader is slow, you are paying Nvidia millions of dollars to run a luxury screensaver. The data pipeline must be flawless:

[ Parallel Storage Array ] 
           │ (Read raw dataset)
           ▼
[ Host CPU RAM / Dataloader ] (Decode, crop, tokenize, batch)
           │ (Direct GPU Transfer via GPUDirect Storage)
           ▼
[ GPU HBM Memory ] ──► [ GPU Tensor Cores ]

Your dataset structure dictates how quickly your storage will fail:

Streaming Bandwidth (Sequential Reads): Packaging datasets into large sequential files (TFRecord or WebDataset) leverages raw streaming speed (GB/s).
Metadata & Random Reads (IOPS): If your loader pulls millions of individual raw files from a parallel filesystem (like Lustre or GPFS), get ready for metadata crashes. When a directory lookup takes a millisecond too long, the pipeline stalls, and the GPU starves.
Checkpointing (Write Bandwidth): Hardware fails constantly. To avoid losing weeks of training, you write checkpoints. Writing the state of a 70B parameter model means dumping several terabytes to disk. If the training loop has to freeze while this writes, your throughput dies. This is why we use tools like IO500 to benchmark streaming I/O (IOR) and metadata stress (mdtest).

And let’s not forget the corporate overhead. Between Lustre split-brains corrupting your files, Kubernetes adding layers of networking abstraction to orchestrate things that should have been simple bare-metal scripts, and meeting government compliance (like the EU Cyber Resilience Act or NIS2 audit theater), the infrastructure is always a house of cards.

The Architect’s Metric: Model FLOPs Utilization (MFU)

How do we actually measure if our investment is a waste? We look at Model FLOPs Utilization (MFU):

$\text{MFU} = \frac{\text{Achieved FLOP/s}}{\text{Peak Hardware FLOPS}}$

Hardware FLOPS Utilization (HFU): Counts everything the chip did, including wasteful overhead like padding, masking, or recomputing activations because you ran out of memory.
Model FLOPS Utilization (MFU): Only counts the actual math operations required to process the tokens. This is the real efficiency metric.

If your network is misconfigured, your MFU will drop to 15% to 20%—meaning 80% of your cluster budget is going straight to waste. If you tune the stack, optimize kernels, and run on clean RDMA, you might get MFU up to 45% to 55%.

Summary Checklist for Systems Architects

Before you tell your boss the code is “production-ready,” run through this checklist:

Precision Matching: Stop using FP32 where BF16 or FP8 works.
Non-Blocking Fabric: Do not build multi-node setups over standard, oversubscribed TCP networks. Use dedicated RDMA (InfiniBand or RoCE v2).
Pre-fetching and Caching: Build pipelines to pre-fetch into RAM and use GPUDirect Storage (GDS) to bypass the host CPU overhead.
Fast Storage and Metadata: Use parallel filesystems tuned for high IOPS to survive checkpoint dumps and metadata queries without triggering a split-brain.

Wrapping Up

Peak FLOPS is marketing fluff. Throughput is what actually trains your model. When planning AI infrastructure, don’t just stare at the GPU specs. If you don’t engineer the memory, network, and storage, you’re just buying a very expensive way to generate heat.

How NIS2 Affects the European Maritime Industry

ML Versus Classic HPC