Hi Gereon, which filesystem do you use? Best Marcus On 3/13/19 10:25 AM, Gereon Kremer wrote:
Hi,
a quick update on this as I somewhat narrowed this down. It seems to come down to IO as simply opening a small file (in this case less than 4K) seems to be incredibly slow sometimes. I observe taking the following piece of code (multiple functions squashed together) taking up to 8 seconds, usually it is 1ms:
print_current_time("parsing input"); if (pathToInputFile == "-") { /* not executed */ } std::ifstream infile(pathToInputFile); if (!infile.good()) { /* not executed */ } print_current_time("parse");
For reference: slurmstepd: error: *** JOB 520597 ON nihm019 CANCELLED AT 2019-03-12T23:04:13 DUE TO TIME LIMIT *** And this particular event happened some 14-17 minutes before that.
Best, Gereon
On 3/12/19 9:16 AM, Gereon Kremer wrote:
Hi,
(as of now) we mainly use the system for benchmarking, that is we measure how long it takes a solver to run on a particular input. I'm not so much interested in the result of the solver, but rather want to know the runtime. I honestly don't really care about a difference of a second or whether we use wall-clock or CPU time. After our solver has loaded the input file (of only a few KB) there is no further IO and the whole process is entirely CPU bound -- so we assume 100% CPU load and just assume that CPU time and wall-clock time are essentially the same. All tasks are run with a timeout (here: two minutes + 3 seconds grace time accounting for CPU vs. wall clock etc. measured with date) and a memout (here: 8GB)
The corresponding part of the script looks like this (with $cmd being the command that is run):
start=`date +"%s%3N"` ulimit -c 0 && ulimit -S -v 8388608 && ulimit -S -t 123 && time $cmd end=`date +"%s%3N"` echo "time: $(( end - start ))"
I however observer that from time to time a task takes way longer than it should, i.e. the time that is output is way beyond 120. I currently have an example with above 5 minutes and have already seen instances with almost 10 minutes.
About every second run or so (one run being an array job with 1000 individual jobs running 12 tasks each) I hit a case where one individual task takes way longer. The time output would then look like this: 122.24s user 0.53s system 40% cpu 5:05.67 total
Unfortunately I cannot really reproduce it: it happens with seemingly random inputs and only once or twice on a run. It however happens rather consistently every second run or so. Running this particular input (on the login node) is just fine with 100% CPU load and stopped by ulimit after 123 seconds.
As I run multiple tasks within one array job this also leads to this array job being canceled (as I compute the overall time limit from the timeouts and assume that every task actually finishes within its timeout), for example:
slurmstepd: error: *** JOB 491231 ON nihm017 CANCELLED AT 2019-03-11T23:57:48 DUE TO TIME LIMIT ***
(The issue happened about 6-8 minutes earlier than this message)
Can you trace this to something happening on the nodes? Or do I simply have to rerun stuff until it did not happen anymore?
Best, Gereon
_______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
_______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
-- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner@itc.rwth-aachen.de www.itc.rwth-aachen.de