A simple script for probing status from SLURM cluster nodes.

A common problem that appears once in a while in our cluster at CENAPAD-UFC is that not all the processes are terminated on the nodes when the job is aborted. This is a wrong behavior, and we haven’t been able to fix it within SLURM. The solution I found is to constantly check the nodes to see if any of them has load average greater than the number of cores available. If an overloaded node is found, the administrator may log in and check of zombie processes.

If you have a SLURM cluster, the best way to monitor the nodes is installing Ganglia . But if by any chance you can’t install Ganglia, another solution is using pdsh and a little bit of scripting in Python. The only customization needed is changing the value of numberOfCoresPerNode according your cluster configuration. Also, partition must be the name of a partition with all the nodes you want to monitor.

# coding=UTF-8
import os
from collections import OrderedDict

numberOfCoresPerNode = 12
partition = "superlong"

# Retrieve the list of available nodes
stream = os.popen("sinfo -p " + partition + " -t ALLOC,IDLE -o %N | tail -1")
nodeStr = stream.readline()

print "Available nodes are: " + nodeStr[:-1]

# Retrieve the load of each node
stream = os.popen("pdsh -w " + nodeStr[:-1] + " cat /proc/loadavg")
output = {}
for line in stream.readlines() :
key = line.split(":")[0]
value = line.split(":")[1].split(" ")[1]
output[key] = value

# Sort the output dictionary according the keys values
orderedOutput = OrderedDict(sorted(output.items(), key=lambda item: int(item[0][6:])))

# Print information about each node.
print "Node load: "
for key in orderedOutput.keys():
if float(orderedOutput[key]) > numberOfCoresPerNode + 4:
print "OVERLOAD>>>" + key + ":" + orderedOutput[key]
print key + ":" + orderedOutput[key]

We consider that a node is overloaded if the load is greater that the number of cores plus 4. That can be changed at the last for loop.