Hi,
a quick update on this as I somewhat narrowed this down.
It seems to come down to IO as simply opening a small file (in this case
less than 4K) seems to be incredibly slow sometimes. I observe taking
the following piece of code (multiple functions squashed together)
taking up to 8 seconds, usually it is 1ms:
print_current_time("parsing input");
if (pathToInputFile == "-") { /* not executed */ }
std::ifstream infile(pathToInputFile);
if (!infile.good()) { /* not executed */ }
print_current_time("parse");
For reference:
slurmstepd: error: *** JOB 520597 ON nihm019 CANCELLED AT
2019-03-12T23:04:13 DUE TO TIME LIMIT ***
And this particular event happened some 14-17 minutes before that.
Best,
Gereon
On 3/12/19 9:16 AM, Gereon Kremer wrote:
Hi,
(as of now) we mainly use the system for benchmarking, that is we
measure how long it takes a solver to run on a particular input. I'm not
so much interested in the result of the solver, but rather want to know
the runtime.
I honestly don't really care about a difference of a second or whether
we use wall-clock or CPU time. After our solver has loaded the input
file (of only a few KB) there is no further IO and the whole process is
entirely CPU bound -- so we assume 100% CPU load and just assume that
CPU time and wall-clock time are essentially the same. All tasks are run
with a timeout (here: two minutes + 3 seconds grace time accounting for
CPU vs. wall clock etc. measured with date) and a memout (here: 8GB)
The corresponding part of the script looks like this (with $cmd being
the command that is run):
start=`date +"%s%3N"`
ulimit -c 0 && ulimit -S -v 8388608 && ulimit -S -t 123 && time $cmd
end=`date +"%s%3N"`
echo "time: $(( end - start ))"
I however observer that from time to time a task takes way longer than
it should, i.e. the time that is output is way beyond 120. I currently
have an example with above 5 minutes and have already seen instances
with almost 10 minutes.
About every second run or so (one run being an array job with 1000
individual jobs running 12 tasks each) I hit a case where one individual
task takes way longer. The time output would then look like this:
122.24s user 0.53s system 40% cpu 5:05.67 total
Unfortunately I cannot really reproduce it: it happens with seemingly
random inputs and only once or twice on a run. It however happens rather
consistently every second run or so.
Running this particular input (on the login node) is just fine with 100%
CPU load and stopped by ulimit after 123 seconds.
As I run multiple tasks within one array job this also leads to this
array job being canceled (as I compute the overall time limit from the
timeouts and assume that every task actually finishes within its
timeout), for example:
slurmstepd: error: *** JOB 491231 ON nihm017 CANCELLED AT
2019-03-11T23:57:48 DUE TO TIME LIMIT ***
(The issue happened about 6-8 minutes earlier than this message)
Can you trace this to something happening on the nodes? Or do I simply
have to rerun stuff until it did not happen anymore?
Best,
Gereon
_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de