Hi, (as of now) we mainly use the system for benchmarking, that is we measure how long it takes a solver to run on a particular input. I'm not so much interested in the result of the solver, but rather want to know the runtime. I honestly don't really care about a difference of a second or whether we use wall-clock or CPU time. After our solver has loaded the input file (of only a few KB) there is no further IO and the whole process is entirely CPU bound -- so we assume 100% CPU load and just assume that CPU time and wall-clock time are essentially the same. All tasks are run with a timeout (here: two minutes + 3 seconds grace time accounting for CPU vs. wall clock etc. measured with date) and a memout (here: 8GB) The corresponding part of the script looks like this (with $cmd being the command that is run): start=`date +"%s%3N"` ulimit -c 0 && ulimit -S -v 8388608 && ulimit -S -t 123 && time $cmd end=`date +"%s%3N"` echo "time: $(( end - start ))" I however observer that from time to time a task takes way longer than it should, i.e. the time that is output is way beyond 120. I currently have an example with above 5 minutes and have already seen instances with almost 10 minutes. About every second run or so (one run being an array job with 1000 individual jobs running 12 tasks each) I hit a case where one individual task takes way longer. The time output would then look like this: 122.24s user 0.53s system 40% cpu 5:05.67 total Unfortunately I cannot really reproduce it: it happens with seemingly random inputs and only once or twice on a run. It however happens rather consistently every second run or so. Running this particular input (on the login node) is just fine with 100% CPU load and stopped by ulimit after 123 seconds. As I run multiple tasks within one array job this also leads to this array job being canceled (as I compute the overall time limit from the timeouts and assume that every task actually finishes within its timeout), for example: slurmstepd: error: *** JOB 491231 ON nihm017 CANCELLED AT 2019-03-11T23:57:48 DUE TO TIME LIMIT *** (The issue happened about 6-8 minutes earlier than this message) Can you trace this to something happening on the nodes? Or do I simply have to rerun stuff until it did not happen anymore? Best, Gereon -- Gereon Kremer Lehr- und Forschungsgebiet Theorie Hybrider Systeme RWTH Aachen Tel: +49 241 80 21243