A simple script for probing status from SLURM cluster nodes.

A common problem that appears once in a while in our cluster at CENAPAD-UFC is that not all the processes are terminated on the nodes when the job is aborted. This is a wrong behavior, and we haven’t been able to fix it within SLURM. The solution I found is to constantly check the nodes to see if any of them has load average greater than the number of cores available. If an overloaded node is found, the administrator may log in and check of zombie processes.

If you have a SLURM cluster, the best way to monitor the nodes is installing Ganglia . But if by any chance you can’t install Ganglia, another solution is using pdsh and a little bit of scripting in Python. The only customization needed is changing the value of numberOfCoresPerNode according your cluster configuration. Also, partition must be the name of a partition with all the nodes you want to monitor.

# coding=UTF-8
import os
from collections import OrderedDict

numberOfCoresPerNode = 12
partition = "superlong"

# Retrieve the list of available nodes
stream = os.popen("sinfo -p " + partition + " -t ALLOC,IDLE -o %N | tail -1")
nodeStr = stream.readline()

print "Available nodes are: " + nodeStr[:-1]

# Retrieve the load of each node
stream = os.popen("pdsh -w " + nodeStr[:-1] + " cat /proc/loadavg")
output = {}
for line in stream.readlines() :
key = line.split(":")[0]
value = line.split(":")[1].split(" ")[1]
output[key] = value

# Sort the output dictionary according the keys values
orderedOutput = OrderedDict(sorted(output.items(), key=lambda item: int(item[0][6:])))

# Print information about each node.
print "Node load: "
for key in orderedOutput.keys():
if float(orderedOutput[key]) > numberOfCoresPerNode + 4:
print "OVERLOAD>>>" + key + ":" + orderedOutput[key]
print key + ":" + orderedOutput[key]

We consider that a node is overloaded if the load is greater that the number of cores plus 4. That can be changed at the last for loop.


Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair /  Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )


Conectando a %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.