The Compliant Node: When Dependency Failures are a Feature

Every infrastructure engineer knows the cold sweat. A management node reboots, the compute fleet drops dead, and you’re suddenly staring down a massive outage because you thought you could run a “highly available” cluster without Pacemaker drinking itself to death. You SSH in, frantically type systemctl status nfs-server --failed, and watch the terminal fill with angry red text.

inactive (dead)
Dependency failed for Cluster Controlled nfs-server

Your immediate reaction is to start aggressively forcing services back up. Don’t. If you’re dealing with a High Availability (HA) cluster managed by the absolute misery of Corosync and Pacemaker, manual intervention is the fastest way to trigger a glorious split-brain.

That wall of red text isn’t a failure. It’s the system working exactly as designed. Your node isn’t broken; it’s just obeying the cluster state machine to prevent a catastrophic storage apocalypse.

The Illusion of Local Failure

On a standalone server, systemd is a simple dictator. A service dies, you restart it, you go drink coffee. In HA land, you have to throw that troubleshooting muscle memory directly in the bin.

When an active/passive cluster node reboots, it wakes up with intentional amnesia. If the cluster already migrated your resources to the peer node, the rebooted node must sit down and shut up.

Sure, we could have over-engineered this with a Kubernetes cluster running on a fleet of multi-million dollar NVIDIA H100s, paying the cloud lock-in tax just to serve a few files. Or we could have filled out endless compliance theater spreadsheets to satisfy NIS2 audits. Instead, we chose the classic, intimate misery of Corosync.

If the node unilaterally started NFS while the other node was also serving it, you’d get a split-brain. That’s a one-way ticket to corrupted storage, lost metadata, and a weekend spent explaining why your shared filesystem looks like it went through a blender.

So the cluster forcefully suppresses local services. The node is perfectly fine; it’s just being a compliant subordinate to prevent a data-destroying mutiny.

Anatomy of a Dependency Chain

Pacemaker builds strict, sequential dependency chains because developers can’t be trusted to write self-healing code. They assume their app is “production-ready” when it’s actually just three shell scripts in a trench coat, built on a house of cards.

A typical clustered NFS stack looks like this:

Storage: Mount the underlying block devices (fs-data-nfs-optcluster).
Network: Bind the Floating Virtual IP (vip-nfs).
Application: Start the NFS Server (service-nfs).

This ordering is absolute. If Step 2 fails, Step 3 is dead in the water. The cluster won’t even think about attempting to start the NFS daemon if the virtual IP isn’t ready.

When you check the logs and see Job nfs-server.service/start failed with result 'dependency', the system is screaming: “I cannot start because the resource that must happen before me is missing!”

Finding the Real Fire

Recently, one of our management nodes (mgmt-01) bit the dust, leaving dozens of compute nodes hanging while they waited for NFS mounts. The rebooted node was throwing dependency errors left and right, but the actual fire was elsewhere.

I ran pcs status to see what kind of cluster disaster was unfolding:

  * Resource Group: nfs:
    * fs-data-nfs-optcluster    (ocf::heartbeat:Filesystem):     Started mgmt-02
    * vip-nfs   (ocf::heartbeat:IPaddr2):        Stopped
    * service-nfs       (systemd:nfs-server):    Stopped

Look at that closely. During the failover chaos, the storage successfully migrated to mgmt-02—which, if you’ve ever dealt with Lustre split-brains or metadata crashes, is basically a mathematical miracle. But the network stack blinked. The Virtual IP (vip-nfs) failed to bind on the new host.

Since the VIP failed to start, the entire dependency chain snapped. The active node refused to start the NFS service, and the passive node was ordered to stay down. The entire setup was held hostage by one tiny networking hiccup.

Trust the State Machine

The fix wasn’t rebuilding the NFS stack or editing local configs. In fact, it was just a single, simple cluster command:

pcs resource cleanup vip-nfs

I ran that on the active node, which cleared the transient error state and forced the cluster to try binding the IP again. As soon as the IP came online, the dependency was satisfied, and NFS spun right up.

The big takeaway? When you’re dealing with HA clusters, local logs will straight-up lie to you. The next time you see a wall of ‘Dependency Failed’ errors after a reboot, take a breath. Don’t fight systemd.

Zoom out, query the cluster state machine, and find where the chain is actually broken. Those scary red errors are just the sound of your architecture saving your data—and sparing you from a NIS2 regulatory compliance audit nightmare.

It’s almost comforting when the system is smart enough to fail on purpose.

ML Versus Classic HPC

The "Virtual Supercomputer": Why the Slurm Operator Pattern Bridges the Gap Between Research and Cloud Ops