Dear users,
thanks to several reports we have discovered a problem when trying to
submit multi-node jobs that request more than 24 tasks per node. In
general a resource request looking like this should work perfectly fine:
(...)
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=48
(...)
Theoretically this would allow you to make full use of 5 nodes.
Currently, however, sbatch rejects such job scripts claiming that there
were no hosts suitable for dispatchment. Despite this, the following
request
(...)
#SBATCH --ntasks=240
(...)
works as intended while being semantically equal in this scenario. We
are not sure exactly what is causing this problem but do suspect a bug
in slurm, possibly in conjunction with the Skylake-SP CPUs. If you are
affected, we recommend to use --ntasks only for the time being. We will
change the documentation accordingly so that you can build your job
scripts upon correct templates. The problem has been relayed to the
developers, we will have to wait for their assessment.
Please excuse any inconvenience.
Best,
Sven
--
Sven Hansen
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen (Germany)
Tel.: + 49 241 80-29114
s.hansen(a)itc.rwth-aachen.de
www.itc.rwth-aachen.de
Hello Marcus,
unfortunately I get essentially the same problem. srun spawns #cores instances of the CFX solver, every one of which tries to access all cores.
Since they still try to communicate with the other node over ssh the result is the same error as below.
Regards,
Thomas
Von: Marcus Wagner [mailto:wagner@itc.rwth-aachen.de]
Gesendet: Dienstag, 12. Februar 2019 15:21
An: claix18-slurm-pilot(a)lists.rwth-aachen.de
Betreff: [claix18-slurm-pilot] Re: Multi-Node ANSYS simulations
Dear Thomas,
could you please test the following:
srun cfx5solve -batch -parallel -partition $SLURM_NTASKS -def job.def -par-dist "$CFXHOSTS" -start-method "Intel MPI Distributed Parallel".
Best
Marcus
On 2/12/19 11:10 AM, Gier, Thomas wrote:
Hello,
I'm having issues running ANSYS CFX calculations across multiple nodes.
Single-node simulations run fine, but multi-node configurations crash because ssh connections are being denied:
" +--------------------------------------------------------------------+
| An error has occurred in cfx5solve: |
| |
| Remote connection to ncm0791.hpc.itc.rwthaachen.de |
| (ncm0791.hpc.itc.rwth-aachen.de) could not be started, or exited |
| with return code 255. It gave the following output: |
| |
| Permission denied (publickey,gssapi-keyex,gssapi-with-mic,pass- |
| word,hostbased). |
| |
| Check that you have typed the hostname correctly, and that you |
| have an account "tg084461" on the specified host with access |
| permission from this host. You can use the following command to |
| check the connection to a UNIX machine: |
| |
| ssh ncm0791.hpc.itc.rwth-aachen.de uname |
+--------------------------------------------------------------------+"
Am I missing something in my submission script, or is this a cluster config issue?
Regards,
Thomas Gier
_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de<mailto:claix18-slurm-pilot@lists.rwth-aachen.de>
To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de<mailto:claix18-slurm-pilot-leave@lists.rwth-aachen.de>
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner(a)itc.rwth-aachen.de<mailto:wagner@itc.rwth-aachen.de>
www.itc.rwth-aachen.de<http://www.itc.rwth-aachen.de>
Hello,
I'm having issues running ANSYS CFX calculations across multiple nodes.
Single-node simulations run fine, but multi-node configurations crash because ssh connections are being denied:
" +--------------------------------------------------------------------+
| An error has occurred in cfx5solve: |
| |
| Remote connection to ncm0791.hpc.itc.rwthaachen.de |
| (ncm0791.hpc.itc.rwth-aachen.de) could not be started, or exited |
| with return code 255. It gave the following output: |
| |
| Permission denied (publickey,gssapi-keyex,gssapi-with-mic,pass- |
| word,hostbased). |
| |
| Check that you have typed the hostname correctly, and that you |
| have an account "tg084461" on the specified host with access |
| permission from this host. You can use the following command to |
| check the connection to a UNIX machine: |
| |
| ssh ncm0791.hpc.itc.rwth-aachen.de uname |
+--------------------------------------------------------------------+"
Am I missing something in my submission script, or is this a cluster config issue?
Regards,
Thomas Gier
Hi,
a few questions/points form my side:
- Claix 18 is using Intel OPA. What network topology is used? I guess it
is a Fat tree. Is it blocking or non-blocking? Is it 1:2 blocking as on
Claix 16?
- Have you configured topology-aware resource allocation within the
SLURM scheduler, in other words does the scheduler knows the topology
and tries to minimize the hop-count?
- I assume TurboBoost is enabled by default? Is it possible (or will it
be possible in the future) to include an option to switch TurboBoost
off? E.g. on JURECA it is possible to disable TurboBoost with `#SBATCH
--disable-turbomode` for measurements.
Otherwise it is possible to set frequencies with likwid?
- I tried to run jobs with 256 nodes. I am getting an MPI error (cf. job
92414):
nrm008.hpc.itc.rwth-aachen.de.233340PSM2 no hfi units are active (err=23)
[245] MPI startup(): tmi fabric is not available and fallback fabric is
not enabled
Any ideas where this is coming from? Should I manually adjust
I_MPI_FABRICS? I don't want to set the fallback fabric, since this would
be TCP and would significantly impact performance. Affected jobs are not
canceled but are running into the time limit.
Theses errors are only occurring on the nrm nodes, not on the ncm nodes.
Could there be a problem with the nrm nodes? Currently I am using the
partition c18m, which contains both nodes types. How can I only select
ncm nodes?
- Historic jobs can not be viewed with scontrol or sstat. sacct on the
other hand works. For example:
$ scontrol show job 90949
slurm_load_jobs error: Invalid job id specified
$ sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 90945
AveCPU AvePages AveRSS AveVMSize JobID
---------- ---------- ---------- ---------- ------------
sstat: error: couldn't get steps for job 90945
- Some of my jobs don't appear in the queue and are not scheduled, even
if sbatch returns `Submitted batch job 92365`
- You are using a hand-build module system. One issue with this approach
is that dependencies are not dissolved properly. For example loading the
module python does something unexpected. A short example:
$ module load intel; ldd main.x | grep mkl
intel/19.0 already loaded, doing
nothing [ WARNING ]
libmkl_intel_lp64.so =>
/opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_lp64.so
(0x00002ac58b9ee000)
libmkl_core.so =>
/opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_core.so
(0x00002ac58c53c000)
libmkl_intel_thread.so =>
/opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_thread.so
(0x00002ac5906c8000)
$ module load python; ldd main.x | grep mkl
Loading python 2.7.12 [ OK ]
The SciPy Stack available: http://www.scipy.org/stackspec.html
Build with GCC compilers.
libmkl_intel_lp64.so =>
/usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_lp64.so
(0x00002abea70ad000)
libmkl_core.so =>
/usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_core.so
(0x00002abea7bcb000)
libmkl_intel_thread.so =>
/usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_thread.so
(0x00002abea96ba000)
Using mkl_get_version_string() shows that the python mkl is version
2017.0.0 instead of the expected 2019.0.1 version that should be loaded.
A different approach to a hand-build module system would be using
easybuild to create the module system. This would avoid such issues.
The blue print paper can be found here:
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7830454
JSC has published there easybuild configuration on github:
https://github.com/easybuilders/JSC
The config files from HPC-UGent are also publicly available.
Best,
Sebastian
Hi all,
I have looked into the Network Performance on CLAIX18. I have measured
latency and bandwidth for intra and inter node communication. I have
used the Intel IMB PingPong Benchmark complied with the modules
intel/19.0 and intelmpi/2019. To get a sufficient statistic I have
submitted 64 jobs with 1 node using 2 tasks and 64 jobs with 2 nodes
using 1 task each, respectively. The scheduler has started the jobs on
different sets of nodes. I have attached the results, showing the
configuration and the average, min and max of the measurement.
Let's first look at the inter node communication: I have measured an
average latency of 2.12 usec. In the best case I measured 1 usec and in
the maximum is 7.1 usec. The bandwidth is on average 6488 Mbytes/sec.
The maximum is 11995 Mbytes/sec and minimum is 2483 Mbytes/sec.
The latency for intra node communication looks okay, however the
bandwidth shows variation.
On average theses results don't correspond with the advertised values
from Intel. Either I have done something wrong or I haven't understood
the topology or there is a problem with the machine.
Have you run such a benchmark as well? Can you observe something similar?
@Marcus: To get a better understanding of the machine, could you please
share a bit more information on the network topology:
- How many levels does the tree have?
- On which level is the tree pruned?
- Could you send me the connectivity file / connection map, e.g. a list
of cables connecting the nodes, edge and core switches? I would like to
add the hop count information into my result. (I have a script for
computing the hop count from a connection map. Depending on the format I
just need to adjust the reading routine)
Cheers,
Sebastian
Dear list,
apparently the Clang installation is build against the headers of the
default GCC 4.8.5, which limits its understanding of C++ to C++11. Would
it be possible to at least target the headers of GCC5? That would at
least support C++14. Of course, with GCC8.2.0 being available as well,
C++17 should also be possible ;)
Kind regards,
Philipp
--
Philipp Berger https://moves.rwth-aachen.de/people/berger/
Software Modeling and Verification Group
RWTH Aachen University Phone +49/241/80-21206
Ahornstraße 55, 52056 Aachen, Germany
Dear all,
i made a mistake yesterday and misinterpreted the TIME column of
'squeue'. For PENDING jobs, it shows the requested time, for RUNNING
jobs, it shows the time since jobstart.
So the following jobs will hit the maintenance mark of 9 o'clock. So I
also had to requeue them:
111600,116276,116231,116287,119637,119647,118940,119631
Best
Marcus
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner(a)itc.rwth-aachen.de
www.itc.rwth-aachen.de
Dear all,
we will have to do another maintenance. It will begin tomorrow at 9:00
o'clock.
Sorry for the short period, in which we could inform you.
Longer running jobs hda to be requeued. That are the following JobIDs:
117826, 116379
Best
Marcus
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner(a)itc.rwth-aachen.de
www.itc.rwth-aachen.de