[claix18-slurm-pilot] Semi-repeatable low CPU usage

12 Mar 2019

      Hi,

(as of now) we mainly use the system for benchmarking, that is we
measure how long it takes a solver to run on a particular input. I'm not
so much interested in the result of the solver, but rather want to know
the runtime.
I honestly don't really care about a difference of a second or whether
we use wall-clock or CPU time. After our solver has loaded the input
file (of only a few KB) there is no further IO and the whole process is
entirely CPU bound -- so we assume 100% CPU load and just assume that
CPU time and wall-clock time are essentially the same. All tasks are run
with a timeout (here: two minutes + 3 seconds grace time accounting for
CPU vs. wall clock etc. measured with date) and a memout (here: 8GB)

The corresponding part of the script looks like this (with $cmd being
the command that is run):

start=`date +"%s%3N"`
ulimit -c 0 && ulimit -S -v 8388608 && ulimit -S -t 123 && time $cmd
end=`date +"%s%3N"`
echo "time: $(( end - start ))"

I however observer that from time to time a task takes way longer than
it should, i.e. the time that is output is way beyond 120. I currently
have an example with above 5 minutes and have already seen instances
with almost 10 minutes.

About every second run or so (one run being an array job with 1000
individual jobs running 12 tasks each) I hit a case where one individual
task takes way longer. The time output would then look like this:
122.24s user 0.53s system 40% cpu 5:05.67 total

Unfortunately I cannot really reproduce it: it happens with seemingly
random inputs and only once or twice on a run. It however happens rather
consistently every second run or so.
Running this particular input (on the login node) is just fine with 100%
CPU load and stopped by ulimit after 123 seconds.

As I run multiple tasks within one array job this also leads to this
array job being canceled (as I compute the overall time limit from the
timeouts and assume that every task actually finishes within its
timeout), for example:

slurmstepd: error: *** JOB 491231 ON nihm017 CANCELLED AT
2019-03-11T23:57:48 DUE TO TIME LIMIT ***

(The issue happened about 6-8 minutes earlier than this message)

Can you trace this to something happening on the nodes? Or do I simply
have to rerun stuff until it did not happen anymore?

Best,
Gereon

-- 
Gereon Kremer
Lehr- und Forschungsgebiet Theorie Hybrider Systeme
RWTH Aachen
Tel: +49 241 80 21243