
A buddy of mine was recently tearing his hair out trying to schedule jobs on a shared cluster. Half the team wanted to run massive climate simulations using legacy code held together by duct tape, and the other half was trying to train a massive language model that probably just generates marketing garbage. At first glance, classic HPC (solving actual physics equations) and modern ML (shuffling statistical weights until the vibes check out) look like completely different beasts.
But hereâs the reality: at the infrastructure level, they are knife-fighting in a dark alley for the exact same hardware. Both need absurd parallel compute, low-latency networking, and job schedulers that actually schedule things instead of hanging indefinitely while you rack up cloud bills.
As scientific modeling and deep learning merge, supercomputer architectures are evolving to handle bothâmostly by forcing everyone to pay the exorbitant NVIDIA tax and accept total cloud lock-in.
CPU vs. GPU Economics
Remember when classic HPC ran on CPUs? Good times. CPUs are great generalists. They handle complex logic, branching, and legacy code written before your parents were born.
GPUs, on the other hand, are specialized math bricks. They were built to render 3D graphics, which turns out to be exactly the same math as training a model to write subpar JavaScript: doing billions of simple matrix operations in parallel.
For ML, this parallel structure is a perfect match. Naturally, every developer under the sun assumes their legacy workload is âproduction-readyâ and tries to port it to GPUs. Why? Because an accelerator-packed node gives you way more FLOPS per watt, assuming you can actually write CUDA code that doesnât trigger a host-to-device bottleneck every three nanoseconds. Of course, youâll need to sell a kidney to afford the hardware.
Scientific Simulation vs. Statistical Learning
At the end of the day, how these workloads solve problems is fundamentally different:
- Classic HPC (Simulation): We write explicit physical laws (think fluid dynamics or thermodynamics) directly into the code, praying the floating-point errors donât cause the universe to explode.
- Machine Learning (Optimization): We donât program physics. We throw petabytes of data at a neural network, burn a small forestâs worth of electricity, and let it optimize weights and biases until it guesses the answer.
But they overlap heavily: both of these workloads run as tightly coupled parallel jobs. If you donât have high interconnect bandwidth (InfiniBand or RoCE), your multi-million dollar GPU cluster will just sit idle, generating heat and burning cash. A slow network is just an expensive room heater.
Workload Execution Comparison
| Metric | Classic HPC Simulation | Machine Learning Training | Machine Learning Inference |
|---|---|---|---|
| Primary Compute | CPU or GPU | GPU / specialized ASIC (TPU) | GPU / specialized ASIC (LPU) |
| Logic Type | Physics-based equations | Data optimization | Statistical lookup / generation |
| Scaling Focus | Inter-node latency (MPI) | Inter-node latency (NCCL) | Memory bandwidth & capacity |
| Execution Pattern | Batch job | Batch job | Always-on API service |
Job Fragility and Checkpointing
Whether you are simulating the climate or training a giant transformer, you are running multi-day batch jobs. You are also one hardware glitch away from total failure.
In both worlds, the entire cluster acts as a single, fragile machine. If one single node drops dead, or a network switch hiccups, the whole run crashes. Welcome to the nightmare of checkpointing:
- Weather Models: Dumping the entire physical state of the grid (wind, temperature, pressure) to disk at regular intervals.
- AI Training: Dumping weights and optimizer states to disk.
If you are using a shared filesystem like Lustre, prepare for the inevitable split-brain scenarios and metadata crashes that will freeze your jobs. The engineering headache is identical: how to dump terabytes of data to disk without stalling the compute nodes. And of course, good luck explaining to the auditors how this fits into your NIS2 or CRA compliance paperworkâthose regulations care about your audit theater, not your actual data integrity.
Batch Jobs vs. Live Services
Things take a turn when we move from training to inference.
HPC simulations and ML training are batch throughput workloads. You start the job, go get coffee, and hope it finishes before the budget runs out.
But inference? Thatâs a live web service. Some product manager wants it wrapped in a bloated Kubernetes (K8s) cluster with service meshes, ingress controllers, and auto-scalers for a task that could have been a simple cron job. Inference cares about low latency and concurrency. While training needs insane inter-node bandwidth to sync weights, inference needs massive local GPU memory bandwidth and capacity (mostly to store the ever-growing KV cache) so it can respond to user prompts without lag.
Case Study: Weather Archives and Legacy Code
Weather forecasting is a prime example of these worlds colliding.
For decades, centers like the ECMWF have accumulated petabytes of historical weather data on tape archives. Now, developers are using this data to train neural networks that can guess tomorrowâs weather in a fraction of a second.
It works great for normal days. But for rare, catastrophic events (like a volcanic eruption or a chemical leak)âwhere there is no historical training dataâthe neural network will just confidently hallucinate a sunny day. Legacy physics models are still our only hope when reality breaks the training distribution.
Porting Legacy Code to GPUs
Unfortunately, those legacy physics models are written in ancient Fortran. To get them running on modern GPU clusters, you have two choices:
- Compiler Directives (OpenACC): Slapping comment-like pragmas onto the code and hoping the compiler figures out how to run it on a GPU. Itâs quick, but the performance is usually mediocre.
- Framework Rewrites (CUDA/Triton): Rewriting the core math kernels from scratch. Itâs a grueling engineering headache, but itâs the only way to squeeze real performance out of the hardware.
Ultimately, the line between scientific research and commercial AI is blurring. The future is all about unified infrastructure that keeps the network fast, keeps the compute saturated, and prays the nodes donât crash.