ML Versus Classic HPC | Valtteri Kinnunen

A buddy of mine was recently tearing his hair out trying to schedule jobs on a shared cluster. Half the team wanted to run massive climate simulations using legacy code held together by duct tape, and the other half was trying to train a massive language model that probably just generates marketing garbage. At first glance, classic HPC (solving actual physics equations) and modern ML (shuffling statistical weights until the vibes check out) look like completely different beasts.

But here’s the reality: at the infrastructure level, they are knife-fighting in a dark alley for the exact same hardware. Both need absurd parallel compute, low-latency networking, and job schedulers that actually schedule things instead of hanging indefinitely while you rack up cloud bills.

As scientific modeling and deep learning merge, supercomputer architectures are evolving to handle both—mostly by forcing everyone to pay the exorbitant NVIDIA tax and accept total cloud lock-in.

CPU vs. GPU Economics

Remember when classic HPC ran on CPUs? Good times. CPUs are great generalists. They handle complex logic, branching, and legacy code written before your parents were born.

GPUs, on the other hand, are specialized math bricks. They were built to render 3D graphics, which turns out to be exactly the same math as training a model to write subpar JavaScript: doing billions of simple matrix operations in parallel.

For ML, this parallel structure is a perfect match. Naturally, every developer under the sun assumes their legacy workload is “production-ready” and tries to port it to GPUs. Why? Because an accelerator-packed node gives you way more FLOPS per watt, assuming you can actually write CUDA code that doesn’t trigger a host-to-device bottleneck every three nanoseconds. Of course, you’ll need to sell a kidney to afford the hardware.

Scientific Simulation vs. Statistical Learning

At the end of the day, how these workloads solve problems is fundamentally different:

Classic HPC (Simulation): We write explicit physical laws (think fluid dynamics or thermodynamics) directly into the code, praying the floating-point errors don’t cause the universe to explode.
Machine Learning (Optimization): We don’t program physics. We throw petabytes of data at a neural network, burn a small forest’s worth of electricity, and let it optimize weights and biases until it guesses the answer.

But they overlap heavily: both of these workloads run as tightly coupled parallel jobs. If you don’t have high interconnect bandwidth (InfiniBand or RoCE), your multi-million dollar GPU cluster will just sit idle, generating heat and burning cash. A slow network is just an expensive room heater.

Workload Execution Comparison

Metric	Classic HPC Simulation	Machine Learning Training	Machine Learning Inference
Primary Compute	CPU or GPU	GPU / specialized ASIC (TPU)	GPU / specialized ASIC (LPU)
Logic Type	Physics-based equations	Data optimization	Statistical lookup / generation
Scaling Focus	Inter-node latency (MPI)	Inter-node latency (NCCL)	Memory bandwidth & capacity
Execution Pattern	Batch job	Batch job	Always-on API service

Job Fragility and Checkpointing

Whether you are simulating the climate or training a giant transformer, you are running multi-day batch jobs. You are also one hardware glitch away from total failure.

In both worlds, the entire cluster acts as a single, fragile machine. If one single node drops dead, or a network switch hiccups, the whole run crashes. Welcome to the nightmare of checkpointing:

Weather Models: Dumping the entire physical state of the grid (wind, temperature, pressure) to disk at regular intervals.
AI Training: Dumping weights and optimizer states to disk.

If you are using a shared filesystem like Lustre, prepare for the inevitable split-brain scenarios and metadata crashes that will freeze your jobs. The engineering headache is identical: how to dump terabytes of data to disk without stalling the compute nodes. And of course, good luck explaining to the auditors how this fits into your NIS2 or CRA compliance paperwork—those regulations care about your audit theater, not your actual data integrity.

Batch Jobs vs. Live Services

Things take a turn when we move from training to inference.

HPC simulations and ML training are batch throughput workloads. You start the job, go get coffee, and hope it finishes before the budget runs out.

But inference? That’s a live web service. Some product manager wants it wrapped in a bloated Kubernetes (K8s) cluster with service meshes, ingress controllers, and auto-scalers for a task that could have been a simple cron job. Inference cares about low latency and concurrency. While training needs insane inter-node bandwidth to sync weights, inference needs massive local GPU memory bandwidth and capacity (mostly to store the ever-growing KV cache) so it can respond to user prompts without lag.

Case Study: Weather Archives and Legacy Code

Weather forecasting is a prime example of these worlds colliding.

For decades, centers like the ECMWF have accumulated petabytes of historical weather data on tape archives. Now, developers are using this data to train neural networks that can guess tomorrow’s weather in a fraction of a second.

It works great for normal days. But for rare, catastrophic events (like a volcanic eruption or a chemical leak)—where there is no historical training data—the neural network will just confidently hallucinate a sunny day. Legacy physics models are still our only hope when reality breaks the training distribution.

Porting Legacy Code to GPUs

Unfortunately, those legacy physics models are written in ancient Fortran. To get them running on modern GPU clusters, you have two choices:

Compiler Directives (OpenACC): Slapping comment-like pragmas onto the code and hoping the compiler figures out how to run it on a GPU. It’s quick, but the performance is usually mediocre.
Framework Rewrites (CUDA/Triton): Rewriting the core math kernels from scratch. It’s a grueling engineering headache, but it’s the only way to squeeze real performance out of the hardware.

Ultimately, the line between scientific research and commercial AI is blurring. The future is all about unified infrastructure that keeps the network fast, keeps the compute saturated, and prays the nodes don’t crash.

Why Your 1000-GPU Cluster Might Be Slow: Peak FLOPS vs. Delivered Throughput

The Compliant Node: When Dependency Failures are a Feature