Hi all,
some of my jobs are failing. It happens very rarely and with no apparent
reason. Log says it got SIGKILL, although sacct just says COMPLETED. I
had a job this week with this problem and it ran without issue after
restarting it. This is particularly annoying since my jobs usually take
> 1 day. I'm not exceeding my requested runtime or memory limits.
I had just another one like it. I restarted it and believe it will run
through without issue. I attached what sacct reported. It failed on ncm0217.
Anyone had issues like this?
Best
Johannes
--
M.Sc. Johannes Sauer
Researcher
Institut fuer Nachrichtentechnik
RWTH Aachen University
Melatener Str. 23
52074 Aachen
Tel +49 241 80-27678
Fax +49 241 80-22196
sauer(a)ient.rwth-aachen.de
http://www.ient.rwth-aachen.de
Dears all,
I have a rather simple question (and maybe this has already been asked
too many times here).
Is it still the case on both the CLAIX-2018 and CLAIX-2016 machines
managed via SLURM that a single node calculation can be run with 120 h
max. wall time instead of the standard
24 h wall time?
Submitting such a job gives
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2245709 c16m dgrid_Ag ms368752 PD 0:00 1 (AssocMaxWallDurationPerJobLimit)
when I set, e.g.,
#SBATCH --time 48:00:00
in the corresponding job script.
Many thanks in advance,
Mathias Schumacher
Hi everybody,
It seems like in the past two weeks something happened and now the queue is extremely long. I submitted a job yesterday afternoon and only today at 2pm ist started (and immediately exited as it wasnt debugged yet, but this is not the issue). Before that it usually started within moments (<5min).
It seems like an enormous ammount of jobs is being started at once. Is this purposefully? Is there any chance to restrict that (like # of jobs per user)?
Or is it just, that my project has such a low priority?
Best Regards,
Marek
Dear all,
I need to compile C/C++ sources into a 32bit binary, though 32bit libraries are unavailable on the cluster. I have searched the available modules, none of them seem to be dedicated to 32bit.
These are the compilation errors I get upon compilation:
/rwthfs/rz/SW/gcc/CENTOS-7.3/7.3.0/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.3.0/cc1plus: error while loading shared libraries: libmpfr.so.6: cannot open shared object file: No such file or directory
make[2]: *** [CMakeFiles/cwvalidator.dir/cwvalidator.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/rwthfs/rz/SW/gcc/CENTOS-7.3/7.3.0/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.3.0/cc1: error while loading shared libraries: libmpfr.so.6: cannot open shared object file: No such file or directory
make[2]: *** [CMakeFiles/cwvalidator.dir/picoc/platform.c.o] Error 1
make[1]: *** [CMakeFiles/cwvalidator.dir/all] Error 2
make: *** [all] Error 2
Would it be possible to add also 32-bit libraries onto the cluster system? I am using the GCC 7+ compiler, so it would be enough to install these libraries: gcc-multilib, g++-multilib.
I am looking forward to hearing from you.
Yours faithfully,
Jan Svejda
Dear all,
After my simulations were finally started from the queue, they exited almost immediately with simmilar errors:
#1-------------------------------------------------------------------------------------------------------
+--------------------------------------------------------------------+
| An error has occurred in cfx5solve: |
| |
| Cannot create working file def (linking from |
| /rwthfs/rz/cluster/work/ms958471/Ansys_Lichtbogencluster/IIW2019/- |
| 2019-05-20_Cathode_Mok_EM_Gauss2000W_withDrop_1.5mm.def): |
| |
| No space left on device |
+--------------------------------------------------------------------+
This run of the ANSYS CFX Solver has finished.
#2---------------------------------------------------------------------------------------------------------
+--------------------------------------------------------------------+
| ERROR #001100279 has occurred in subroutine ErrAction. |
| Message: |
| copy_dataset: write data block failed: No space left on device |
| |
| |
| |
| |
| |
+--------------------------------------------------------------------+
#3---------------------------------------------------------------------
+--------------------------------------------------------------------+
| An error has occurred in cfx5solve: |
| |
| Error reported by IO module: recreate_indextable: warning: file |
| was not closed correctly, data may be inconsistent |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| An error has occurred in cfx5solve: |
| |
| Error reported by IO module: write_index: fwrite failed writing |
| format: Bad file descriptor |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| An error has occurred in cfx5solve: |
| |
| Error reported by IO module: iif_flush: write_index failed |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| An error has occurred in cfx5solve: |
| |
| Error reported by IO module: cfxwriteString: (fputs failed) |
| syserr:: Bad file descriptor |
+--------------------------------------------------------------------+
These are problems on top of the extra long queue that have happend today for the first time. 2 or 3 weeks ago, these simulations would run flawlessly. I suspect that something strange is happening.
Best Regards,
Marek
Hi,
I'm trying to start an interactive job with:
srun --nodes=1 --ntasks-per-node=48 --mem-per-cpu=3600MB --
time=02:00:00 --pty /bin/zsh
I can get a node:
srun: [I] No output file given, set to: output_%j.txt
srun: job 2054322 queued and waiting for resources
srun: job 2054322 has been allocated resources
However after a moment the job ends:
srun: First task exited 5s ago
srun: step:2054322.0 task 0: running
srun: step:2054322.0 tasks 1-47: exited
srun: Terminating job step 2054322.0
srun: Job step aborted: Waiting up to 62 seconds for job step to
finish.
srun: error: ncm0552: task 0: Killed
This used to work before!
Best regards
Pavel
Sorry,
my original script was:
#!/usr/local_rwth/bin/zsh
### Job name
#SBATCH -J LSDYNA_OMP
### File / path where output will be written, the %J is the job id #SBATCH -o LSDYNA_OpenMPI.%J
### Request the time you need for execution in minutes ### The format is: [hour:]minute, for 80 minutes you can use: 1:20 #SBATCH -t 120:00:00
### Request memory you need for your job in MB #SBATCH --mem-per-cpu=2000M #SBATCH --nodes=1
ulimit -s 600000
### Request the number of compute slots you want to use #SBATCH --ntasks=12
#SBATCH --mail-type=end
#SBATCH --mail-user=sim(a)isf.rwth-aachen.de
#SBATCH --account=rwth0398
### load modules
module load TECHNICS
module load intelmpi
module load lsdyna
cd $WORK/LSDYNA
# start non-interactive batch job
$MPIEXEC --propagate=STACK $FLAGS_MPI_BATCH ls-dyna_mpp_intel i=sFSWmodel.k
Without $ before STACK, just as in the documentation!
Hello everybody,
I want to run LSDYNA with intelmpi and im trying the script (Distributed Memory (Multi-Node, MPI) Parallel Job) as documented here:
https://doc.itc.rwth-aachen.de/display/CC/lsdyna
However i get this fail-message:
(OK) Loading TECHNICS environment
(EE) intelmpi/2018.4.274 already loaded, try unloading it first.
(!!) Please notice: Using lsdyna requires payment.
(!!) If in doubt, please contact your institute's IT-administrator or servicedesk(a)itc.rwth-aachen.de.
(OK) Loading lsdyna R9.1.0
(!!) hybrid parallelised versions for intelmpi only
(!!) MPI parallelised versions for intelmpi or openmpi/1.8.4
/var/spool/slurm/job1878073/slurm_script:33: command not found: --propagate=STACK
Is there maybe something wrong with the script given on the documentation? the variable STACK seems to be undefined, or is it?
My job script looks like this:
---------------------------------------------------------------------------------------------------------------------------------------
#!/usr/local_rwth/bin/zsh
### Job name
#SBATCH -J LSDYNA_OMP
### File / path where output will be written, the %J is the job id
#SBATCH -o LSDYNA_OpenMPI.%J
### Request the time you need for execution in minutes
### The format is: [hour:]minute, for 80 minutes you can use: 1:20
#SBATCH -t 120:00:00
### Request memory you need for your job in MB
#SBATCH --mem-per-cpu=2000M
#SBATCH --nodes=1
ulimit -s 600000
### Request the number of compute slots you want to use
#SBATCH --ntasks=12
#SBATCH --mail-type=end
#SBATCH --mail-user=sim(a)isf.rwth-aachen.de
#SBATCH --account=rwth0398
### load modules
module load TECHNICS
module load intelmpi
module load lsdyna
cd $WORK/LSDYNA
# start non-interactive batch job
$MPIEXEC --propagate=$STACK $FLAGS_MPI_BATCH ls-dyna_mpp_intel i=sFSWmodel.k
--------------------------------------------------------------------------------------------------------------------------
Best Regards,
Marek
Hi,
I'm running a rather large job array on the integrated hosting part (in
the moves account). In our understanding the whole hardware we
contributed to the IH should be split among all jobs of this account,
however way less (array) jobs are running than I would expect. Right now
there is only a single job array running for this account.
The job array has 6000 individual jobs, each needs a single core (I
don't set any arguments affecting core selection) and is running for up
to four minutes. Hence slurm should have a rather easy job to keep every
core busy. Given that we should have 7 nodes with 48 cores each, I
expect the number of running jobs to be at least 200-300 or so.
(Depending on how many jobs terminate very quickly and how long slurm
takes to start new ones).
However I see from `squeue -A moves -t R` that the number ob jobs is
usually around 20-30, sometimes below 10 and never seems to exceed 50.
Are there any limits on how many jobs are run concurrently?
If yes: What are these? Please increase them appropriately, at least for
IH accounts, so that we can actually use our hardware...
If no: What is going on here? I don't set any particular options in the
job, constraints are -C hpcwork -C skx8160. sinfo tells me that the
respective nodes are all available (mix or idle).
Best,
Gereon
--
Gereon Kremer
Lehr- und Forschungsgebiet Theorie Hybrider Systeme
RWTH Aachen
Tel: +49 241 80 21243
Hi,
I've lately noticed some of my jobs failing (timing out) with:
srun: Job 1692770 step creation temporarily disabled, retrying
srun: error: Unable to create step for job 1692770: Unable to contact
slurm controller (connect failure)
Any ideas what could be going wrong? I've been running similar jobs for
a long time and this type of failures seem quite recent...
Best regards
Pavel