May 2019 - claix18-slurm-pilot - lists.rwth-aachen.de

Jobs occasionally failing
by Johannes Sauer 27 May '19

27 May '19

Hi all, some of my jobs are failing. It happens very rarely and with no apparent reason. Log says it got SIGKILL, although sacct just says COMPLETED. I had a job this week with this problem and it ran without issue after restarting it. This is particularly annoying since my jobs usually take > 1 day. I'm not exceeding my requested runtime or memory limits. I had just another one like it. I restarted it and believe it will run through without issue. I attached what sacct reported. It failed on ncm0217. Anyone had issues like this? Best Johannes -- M.Sc. Johannes Sauer Researcher Institut fuer Nachrichtentechnik RWTH Aachen University Melatener Str. 23 52074 Aachen Tel +49 241 80-27678 Fax +49 241 80-22196 sauer(a)ient.rwth-aachen.de http://www.ient.rwth-aachen.de

2 1

Single node jobs and wall time limit
by Mathias Schumacher 26 May '19

26 May '19

Dears all, I have a rather simple question (and maybe this has already been asked too many times here). Is it still the case on both the CLAIX-2018 and CLAIX-2016 machines managed via SLURM that a single node calculation can be run with 120 h max. wall time instead of the standard 24 h wall time? Submitting such a job gives JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2245709 c16m dgrid_Ag ms368752 PD 0:00 1 (AssocMaxWallDurationPerJobLimit) when I set, e.g., #SBATCH --time 48:00:00 in the corresponding job script. Many thanks in advance, Mathias Schumacher

1 0

Long Queue
by simon＠isf.rwth-aachen.de 23 May '19

23 May '19

Hi everybody, It seems like in the past two weeks something happened and now the queue is extremely long. I submitted a job yesterday afternoon and only today at 2pm ist started (and immediately exited as it wasnt debugged yet, but this is not the issue). Before that it usually started within moments (<5min). It seems like an enormous ammount of jobs is being started at once. Is this purposefully? Is there any chance to restrict that (like # of jobs per user)? Or is it just, that my project has such a low priority? Best Regards, Marek

4 4

Compiling C/C++ into a 32bit binary
by Svejda, Jan 23 May '19

23 May '19

Dear all, I need to compile C/C++ sources into a 32bit binary, though 32bit libraries are unavailable on the cluster. I have searched the available modules, none of them seem to be dedicated to 32bit. These are the compilation errors I get upon compilation: /rwthfs/rz/SW/gcc/CENTOS-7.3/7.3.0/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.3.0/cc1plus: error while loading shared libraries: libmpfr.so.6: cannot open shared object file: No such file or directory make[2]: *** [CMakeFiles/cwvalidator.dir/cwvalidator.cpp.o] Error 1 make[2]: *** Waiting for unfinished jobs.... /rwthfs/rz/SW/gcc/CENTOS-7.3/7.3.0/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.3.0/cc1: error while loading shared libraries: libmpfr.so.6: cannot open shared object file: No such file or directory make[2]: *** [CMakeFiles/cwvalidator.dir/picoc/platform.c.o] Error 1 make[1]: *** [CMakeFiles/cwvalidator.dir/all] Error 2 make: *** [all] Error 2 Would it be possible to add also 32-bit libraries onto the cluster system? I am using the GCC 7+ compiler, so it would be enough to install these libraries: gcc-multilib, g++-multilib. I am looking forward to hearing from you. Yours faithfully, Jan Svejda

2 1

No space left on device
by simon＠isf.rwth-aachen.de 23 May '19

23 May '19

Dear all, After my simulations were finally started from the queue, they exited almost immediately with simmilar errors: #1------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------+ | An error has occurred in cfx5solve: | | | | Cannot create working file def (linking from | | /rwthfs/rz/cluster/work/ms958471/Ansys_Lichtbogencluster/IIW2019/- | | 2019-05-20_Cathode_Mok_EM_Gauss2000W_withDrop_1.5mm.def): | | | | No space left on device | +--------------------------------------------------------------------+ This run of the ANSYS CFX Solver has finished. #2--------------------------------------------------------------------------------------------------------- +--------------------------------------------------------------------+ | ERROR #001100279 has occurred in subroutine ErrAction. | | Message: | | copy_dataset: write data block failed: No space left on device | | | | | | | | | | | +--------------------------------------------------------------------+ #3--------------------------------------------------------------------- +--------------------------------------------------------------------+ | An error has occurred in cfx5solve: | | | | Error reported by IO module: recreate_indextable: warning: file | | was not closed correctly, data may be inconsistent | +--------------------------------------------------------------------+ +--------------------------------------------------------------------+ | An error has occurred in cfx5solve: | | | | Error reported by IO module: write_index: fwrite failed writing | | format: Bad file descriptor | +--------------------------------------------------------------------+ +--------------------------------------------------------------------+ | An error has occurred in cfx5solve: | | | | Error reported by IO module: iif_flush: write_index failed | +--------------------------------------------------------------------+ +--------------------------------------------------------------------+ | An error has occurred in cfx5solve: | | | | Error reported by IO module: cfxwriteString: (fputs failed) | | syserr:: Bad file descriptor | +--------------------------------------------------------------------+ These are problems on top of the extra long queue that have happend today for the first time. 2 or 3 weeks ago, these simulations would run flawlessly. I suspect that something strange is happening. Best Regards, Marek

2 1

interactive jobs exit immediately
by Pavel Ondračka 17 May '19

17 May '19

Hi, I'm trying to start an interactive job with: srun --nodes=1 --ntasks-per-node=48 --mem-per-cpu=3600MB -- time=02:00:00 --pty /bin/zsh I can get a node: srun: [I] No output file given, set to: output_%j.txt srun: job 2054322 queued and waiting for resources srun: job 2054322 has been allocated resources However after a moment the job ends: srun: First task exited 5s ago srun: step:2054322.0 task 0: running srun: step:2054322.0 tasks 1-47: exited srun: Terminating job step 2054322.0 srun: Job step aborted: Waiting up to 62 seconds for job step to finish. srun: error: ncm0552: task 0: Killed This used to work before! Best regards Pavel

2 1

Correction: LS Dyna
by simon＠isf.rwth-aachen.de 09 May '19

09 May '19

Sorry, my original script was: #!/usr/local_rwth/bin/zsh ### Job name #SBATCH -J LSDYNA_OMP ### File / path where output will be written, the %J is the job id #SBATCH -o LSDYNA_OpenMPI.%J ### Request the time you need for execution in minutes ### The format is: [hour:]minute, for 80 minutes you can use: 1:20 #SBATCH -t 120:00:00 ### Request memory you need for your job in MB #SBATCH --mem-per-cpu=2000M #SBATCH --nodes=1 ulimit -s 600000 ### Request the number of compute slots you want to use #SBATCH --ntasks=12 #SBATCH --mail-type=end #SBATCH --mail-user=sim(a)isf.rwth-aachen.de #SBATCH --account=rwth0398 ### load modules module load TECHNICS module load intelmpi module load lsdyna cd $WORK/LSDYNA # start non-interactive batch job $MPIEXEC --propagate=STACK $FLAGS_MPI_BATCH ls-dyna_mpp_intel i=sFSWmodel.k Without $ before STACK, just as in the documentation!

2 1

LS Dyna
by simon＠isf.rwth-aachen.de 09 May '19

09 May '19

Hello everybody, I want to run LSDYNA with intelmpi and im trying the script (Distributed Memory (Multi-Node, MPI) Parallel Job) as documented here: https://doc.itc.rwth-aachen.de/display/CC/lsdyna However i get this fail-message: (OK) Loading TECHNICS environment (EE) intelmpi/2018.4.274 already loaded, try unloading it first. (!!) Please notice: Using lsdyna requires payment. (!!) If in doubt, please contact your institute's IT-administrator or servicedesk(a)itc.rwth-aachen.de. (OK) Loading lsdyna R9.1.0 (!!) hybrid parallelised versions for intelmpi only (!!) MPI parallelised versions for intelmpi or openmpi/1.8.4 /var/spool/slurm/job1878073/slurm_script:33: command not found: --propagate=STACK Is there maybe something wrong with the script given on the documentation? the variable STACK seems to be undefined, or is it? My job script looks like this: --------------------------------------------------------------------------------------------------------------------------------------- #!/usr/local_rwth/bin/zsh ### Job name #SBATCH -J LSDYNA_OMP ### File / path where output will be written, the %J is the job id #SBATCH -o LSDYNA_OpenMPI.%J ### Request the time you need for execution in minutes ### The format is: [hour:]minute, for 80 minutes you can use: 1:20 #SBATCH -t 120:00:00 ### Request memory you need for your job in MB #SBATCH --mem-per-cpu=2000M #SBATCH --nodes=1 ulimit -s 600000 ### Request the number of compute slots you want to use #SBATCH --ntasks=12 #SBATCH --mail-type=end #SBATCH --mail-user=sim(a)isf.rwth-aachen.de #SBATCH --account=rwth0398 ### load modules module load TECHNICS module load intelmpi module load lsdyna cd $WORK/LSDYNA # start non-interactive batch job $MPIEXEC --propagate=$STACK $FLAGS_MPI_BATCH ls-dyna_mpp_intel i=sFSWmodel.k -------------------------------------------------------------------------------------------------------------------------- Best Regards, Marek

1 0

Low number of running jobs
by Gereon Kremer 08 May '19

08 May '19

Hi, I'm running a rather large job array on the integrated hosting part (in the moves account). In our understanding the whole hardware we contributed to the IH should be split among all jobs of this account, however way less (array) jobs are running than I would expect. Right now there is only a single job array running for this account. The job array has 6000 individual jobs, each needs a single core (I don't set any arguments affecting core selection) and is running for up to four minutes. Hence slurm should have a rather easy job to keep every core busy. Given that we should have 7 nodes with 48 cores each, I expect the number of running jobs to be at least 200-300 or so. (Depending on how many jobs terminate very quickly and how long slurm takes to start new ones). However I see from `squeue -A moves -t R` that the number ob jobs is usually around 20-30, sometimes below 10 and never seems to exceed 50. Are there any limits on how many jobs are run concurrently? If yes: What are these? Please increase them appropriately, at least for IH accounts, so that we can actually use our hardware... If no: What is going on here? I don't set any particular options in the job, constraints are -C hpcwork -C skx8160. sinfo tells me that the respective nodes are all available (mix or idle). Best, Gereon -- Gereon Kremer Lehr- und Forschungsgebiet Theorie Hybrider Systeme RWTH Aachen Tel: +49 241 80 21243

1 1

srun: step creation temporarily disabled, retrying
by Pavel Ondračka 07 May '19

07 May '19

Hi, I've lately noticed some of my jobs failing (timing out) with: srun: Job 1692770 step creation temporarily disabled, retrying srun: error: Unable to create step for job 1692770: Unable to contact slurm controller (connect failure) Any ideas what could be going wrong? I've been running similar jobs for a long time and this type of failures seem quite recent... Best regards Pavel

1 0