January 2019 - claix18-slurm-pilot - lists.rwth-aachen.de

General questions and problems with 256 node jobs
by Sebastian Achilles 05 Feb '19

05 Feb '19

Hi, a few questions/points form my side: - Claix 18 is using Intel OPA. What network topology is used? I guess it is a Fat tree. Is it blocking or non-blocking? Is it 1:2 blocking as on Claix 16? - Have you configured topology-aware resource allocation within the SLURM scheduler, in other words does the scheduler knows the topology and tries to minimize the hop-count? - I assume TurboBoost is enabled by default? Is it possible (or will it be possible in the future) to include an option to switch TurboBoost off? E.g. on JURECA it is possible to disable TurboBoost with `#SBATCH --disable-turbomode` for measurements. Otherwise it is possible to set frequencies with likwid? - I tried to run jobs with 256 nodes. I am getting an MPI error (cf. job 92414): nrm008.hpc.itc.rwth-aachen.de.233340PSM2 no hfi units are active (err=23) [245] MPI startup(): tmi fabric is not available and fallback fabric is not enabled Any ideas where this is coming from? Should I manually adjust I_MPI_FABRICS? I don't want to set the fallback fabric, since this would be TCP and would significantly impact performance. Affected jobs are not canceled but are running into the time limit. Theses errors are only occurring on the nrm nodes, not on the ncm nodes. Could there be a problem with the nrm nodes? Currently I am using the partition c18m, which contains both nodes types. How can I only select ncm nodes? - Historic jobs can not be viewed with scontrol or sstat. sacct on the other hand works. For example: $ scontrol show job 90949 slurm_load_jobs error: Invalid job id specified $ sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 90945 AveCPU AvePages AveRSS AveVMSize JobID ---------- ---------- ---------- ---------- ------------ sstat: error: couldn't get steps for job 90945 - Some of my jobs don't appear in the queue and are not scheduled, even if sbatch returns `Submitted batch job 92365` - You are using a hand-build module system. One issue with this approach is that dependencies are not dissolved properly. For example loading the module python does something unexpected. A short example: $ module load intel; ldd main.x | grep mkl intel/19.0 already loaded, doing nothing [ WARNING ] libmkl_intel_lp64.so => /opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_lp64.so (0x00002ac58b9ee000) libmkl_core.so => /opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_core.so (0x00002ac58c53c000) libmkl_intel_thread.so => /opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_thread.so (0x00002ac5906c8000) $ module load python; ldd main.x | grep mkl Loading python 2.7.12 [ OK ] The SciPy Stack available: http://www.scipy.org/stackspec.html Build with GCC compilers. libmkl_intel_lp64.so => /usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_lp64.so (0x00002abea70ad000) libmkl_core.so => /usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_core.so (0x00002abea7bcb000) libmkl_intel_thread.so => /usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_thread.so (0x00002abea96ba000) Using mkl_get_version_string() shows that the python mkl is version 2017.0.0 instead of the expected 2019.0.1 version that should be loaded. A different approach to a hand-build module system would be using easybuild to create the module system. This would avoid such issues. The blue print paper can be found here: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7830454 JSC has published there easybuild configuration on github: https://github.com/easybuilders/JSC The config files from HPC-UGent are also publicly available. Best, Sebastian

3 6

Network Performance
by Sebastian Achilles 05 Feb '19

05 Feb '19

Hi all, I have looked into the Network Performance on CLAIX18. I have measured latency and bandwidth for intra and inter node communication. I have used the Intel IMB PingPong Benchmark complied with the modules intel/19.0 and intelmpi/2019. To get a sufficient statistic I have submitted 64 jobs with 1 node using 2 tasks and 64 jobs with 2 nodes using 1 task each, respectively. The scheduler has started the jobs on different sets of nodes. I have attached the results, showing the configuration and the average, min and max of the measurement. Let's first look at the inter node communication: I have measured an average latency of 2.12 usec. In the best case I measured 1 usec and in the maximum is 7.1 usec. The bandwidth is on average 6488 Mbytes/sec. The maximum is 11995 Mbytes/sec and minimum is 2483 Mbytes/sec. The latency for intra node communication looks okay, however the bandwidth shows variation. On average theses results don't correspond with the advertised values from Intel. Either I have done something wrong or I haven't understood the topology or there is a problem with the machine. Have you run such a benchmark as well? Can you observe something similar? @Marcus: To get a better understanding of the machine, could you please share a bit more information on the network topology: - How many levels does the tree have? - On which level is the tree pruned? - Could you send me the connectivity file / connection map, e.g. a list of cables connecting the nodes, edge and core switches? I would like to add the hop count information into my result. (I have a script for computing the hop count from a connection map. Depending on the format I just need to adjust the reading routine) Cheers, Sebastian

2 1

Clang & C++14
by Philipp Berger 05 Feb '19

05 Feb '19

Dear list, apparently the Clang installation is build against the headers of the default GCC 4.8.5, which limits its understanding of C++ to C++11. Would it be possible to at least target the headers of GCC5? That would at least support C++14. Of course, with GCC8.2.0 being available as well, C++17 should also be possible ;) Kind regards, Philipp -- Philipp Berger https://moves.rwth-aachen.de/people/berger/ Software Modeling and Verification Group RWTH Aachen University Phone +49/241/80-21206 Ahornstraße 55, 52056 Aachen, Germany

4 5

interactive sessions, X11 forwarding
by Alekseeva, Uliana 31 Jan '19

31 Jan '19

Hi all, I have a question: how to do the X11 forwarding during interactive sessions? What I do: srun --nodes=4 --account=jara0172 --time=06:00:00 --exclusive --x11 --pty /usr/local_rwth/bin/zsh But if I try to run xclock, I get an error: X11 connection rejected because of wrong authentication. Error: Can't open display: localhost:99.0 Best regards, Uliana --------- Dr. Uliana Alekseeva Forschungszentrum Jülich Institute for Advanced Simulation D-52425 Jülich Phone: (49) 2461 61 3429 IT Center RWTH Aachen University HPC Group Seffenter Weg 23 52074 Aachen Phone: (49) 241 80 29711

2 1

missing modules on backend
by Jonas Becker 25 Jan '19

25 Jan '19

Hi, today the jobs I submit to CLAIX-2018 exit within one second. The new modules, like intelmpi/2019 and openmpi/3.1.3, do not exist on the back end host executing the batch script, nrm020.hpc.itc.rwth-aachen.de . +(0):ERROR:0: Unable to locate a modulefile for 'openmpi/3.1.3' Instead, the module list on the back end seems to be the one of the old LFS cluster parts. (the Claix2018 login node does still offer all new modules) Fixing the --partition=c18m flag does not change this for me. Example Job with error: Job ID 90942 Best, Jonas

4 4

SLURM Array Jobs
by Philipp Berger 24 Jan '19

24 Jan '19

Dear list, I am looking to run large numbers of jobs as an Array Job using SLURM. I created a job file "jobs.txt", containing one configuration per line. Using SLURM_ARRAY_TASK_ID I select the appropriate line for the current task and execute the corresponding configuration in the batch script. My test file currently holds 1308 configurations, which I was unable to submit using sbatch, as the MaxArrayTask variable seems to be set to 1001. What is the optimal/proposed way of scheduling large numbers of configurations? I need one core (or thread?!) per configuration (#SBATCH --cpus-per-task 1 and #SBATCH --ntasks-per-core 1 (are both necessary?)), 20G of memory per configuration (#SBATCH --mem-per-cpu=20480M). My hope was to build these large configuration lists and then later submit them to our own batch system so that it would use all available nodes to crunch all configurations blockwise serialised as fast as possible. Is that not what the SLURM Array Jobs are supposed to do? Kind regards, Philipp -- Philipp Berger https://moves.rwth-aachen.de/people/berger/ Software Modeling and Verification Group RWTH Aachen University Phone +49/241/80-21206 Ahornstraße 55, 52056 Aachen, Germany

2 1

Fwd: IntelMPI/2019 problems with our code
by Marcus Wagner 18 Jan '19

18 Jan '19

-------- Forwarded Message -------- Subject: IntelMPI/2019 problems with our code Date: Fri, 18 Jan 2019 11:44:57 +0100 From: Jonas Becker <jonas.becker2(a)rwth-aachen.de> To: Marcus Wagner <wagner(a)itc.rwth-aachen.de> Hi, our simulation aborts on CLAIX-2018 when using the intelmpi/2019 or intelmpi/2018 module. The new OpenMPI module works for us. The master rank waits for data from the worker ranks in a busy loop: while(!flag) { if (is_chkpt_time()) M_write(); MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &stat); } The problem occurs pretty much reproducibly ~9 minutes after start of the batch job. It occurs even at 2 mpi tasks (one master and one worker process). During the first 9 minutes, the simulation works flawlessly. All calls are in the MPI_COMM_WORLD communicator and work up to this point. Then the simulation aborts: Abort(635909) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Iprobe: Invalid communicator, error stack: PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, MPI_COMM_WORLD, flag=0x7ffe77c98c30, status=0x7ffe77c98c1c) failed PMPI_Iprobe(90).: Invalid communicator Do some of you successfully use intelmpi2018/2019 on the Claix-2018 cluster? Our MPI code may contain a few mistakes that throw off intelmpi but not openmpi. If simulations >10minutes are possible for everyone else, we will look into finding and fixing it in our own code. Another thing is that every time intelmpi is run in claix2018 batch mode, it throws this warning: MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT environment variable is not supported. MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started. Best, Jonas On 1/17/19 10:04 AM, Marcus Wagner wrote: > Dear all, > > > we canceled the actual maintenance, but made a new one for tomorrow 9 > o'clock. So all Jobs, which finish before that time, will run now. > > > Btw. > > this list was not intended as a announcement list from our side. > Weren't there any problems? Is everything clear regarding SLURM and > CLAIX18 for you? > > If that is the case, we are really happy, but I barely can believe this. > > > > Best > Marcus > >

1 1

Maintenance ended
by Marcus Wagner 18 Jan '19

18 Jan '19

Best Marcus -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner(a)itc.rwth-aachen.de www.itc.rwth-aachen.de

1 0

Maintenance interruption
by Marcus Wagner 17 Jan '19

17 Jan '19

Dear all, we canceled the actual maintenance, but made a new one for tomorrow 9 o'clock. So all Jobs, which finish before that time, will run now. Btw. this list was not intended as a announcement list from our side. Weren't there any problems? Is everything clear regarding SLURM and CLAIX18 for you? If that is the case, we are really happy, but I barely can believe this. Best Marcus -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner(a)itc.rwth-aachen.de www.itc.rwth-aachen.de

2 3

Maintenance of CLAIX18
by Marcus Wagner 16 Jan '19

16 Jan '19

Good morning everyone, we have to do a maintenance now, so no new jobs will be scheduled onto the hosts. The running jobs will not be affected. With kind regards Marcus Wagner -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner(a)itc.rwth-aachen.de www.itc.rwth-aachen.de

1 0