Hi,
a few questions/points form my side:
- Claix 18 is using Intel OPA. What network topology is used? I guess it is a Fat tree. Is it blocking or non-blocking? Is it 1:2 blocking as on Claix 16?
- Have you configured topology-aware resource allocation within the SLURM scheduler, in other words does the scheduler knows the topology and tries to minimize the hop-count?
- I assume TurboBoost is enabled by default? Is it possible (or
will it be possible in the future) to include an option to switch
TurboBoost off? E.g. on JURECA it is possible to disable
TurboBoost with `#SBATCH --disable-turbomode` for measurements.
Otherwise it is possible to set frequencies with likwid?
- I tried to run jobs with 256 nodes. I am getting an MPI error
(cf. job 92414):
nrm008.hpc.itc.rwth-aachen.de.233340PSM2 no hfi units are active
(err=23)
[245] MPI startup(): tmi fabric is not available and fallback
fabric is not enabled
Any ideas where this is coming from? Should I manually adjust I_MPI_FABRICS? I don't want to set the fallback fabric, since this would be TCP and would significantly impact performance. Affected jobs are not canceled but are running into the time limit.
Theses errors are only occurring on the nrm nodes, not on the ncm
nodes. Could there be a problem with the nrm nodes? Currently I am
using the partition c18m, which contains both nodes types. How can
I only select ncm nodes?
- Historic jobs can not be viewed with scontrol or sstat. sacct
on the other hand works. For example:
$ scontrol show job 90949
slurm_load_jobs error: Invalid job id specified
$ sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 90945
AveCPU AvePages AveRSS AveVMSize JobID
---------- ---------- ---------- ---------- ------------
sstat: error: couldn't get steps for job 90945
- Some of my jobs don't appear in the queue and are not scheduled, even if sbatch returns `Submitted batch job 92365`
- You are using a hand-build module system. One issue with this approach is that dependencies are not dissolved properly. For example loading the module python does something unexpected. A short example:
$ module load intel; ldd main.x | grep mkl
intel/19.0 already loaded, doing
nothing [ WARNING ]
libmkl_intel_lp64.so =>
/opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_lp64.so
(0x00002ac58b9ee000)
libmkl_core.so =>
/opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_core.so
(0x00002ac58c53c000)
libmkl_intel_thread.so =>
/opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_thread.so
(0x00002ac5906c8000)
$ module load python; ldd main.x | grep mkl
Loading python
2.7.12
[ OK ]
The SciPy Stack available: http://www.scipy.org/stackspec.html
Build with GCC compilers.
libmkl_intel_lp64.so =>
/usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_lp64.so
(0x00002abea70ad000)
libmkl_core.so =>
/usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_core.so
(0x00002abea7bcb000)
libmkl_intel_thread.so =>
/usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_thread.so
(0x00002abea96ba000)
Using mkl_get_version_string() shows that the python mkl is version 2017.0.0 instead of the expected 2019.0.1 version that should be loaded.
A different approach to a hand-build module
system would be using easybuild to create the module system.
This would avoid such issues.
The blue print paper can be found here:
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7830454
JSC has published there easybuild configuration on github:
https://github.com/easybuilders/JSC
The config files from HPC-UGent are also publicly
available.
Best,
Sebastian