- claix18-slurm-pilot - lists.rwth-aachen.de

Starting large amounts of jobs
by Gereon Kremer 25 Feb '19

25 Feb '19

Hello, following the discussion at the end of todays workshop I tried how the scheduler behaves when issuing a larger amount of jobs (Marcus essentially told me I could use approach 3 as detailed below). To frame my question, here is what want to do and how I try to do it (numbers just to get the magnitude): # Problem 10 Binaries, 10k input files. Run every binary on every input file, and collect all the results (= parse stdout). It seems array jobs are the tool for that, however the size of an array job is capped at 1000, apparently because larger jobs make the scheduler slow. # Approach 1 - Create one file with 10*10k lines (./binary input-file) - Create one job with 1000 array jobs - Let ID be the id of the current array job - Identify the slice (10*10k) / 1000 * ID .. (10*10k) / 1000 * (ID + 1) - Execute all lines from the slice sequentially - Pro: Only one job, no scheduling hassle on the user side. - Con: weird script logic, 100 individual tasks in one scheduled array job, sometimes bad load balancing (i.e. one job takes way longer than the others) # Approach 2 - Create (10*10k)/1000 files, each containing 1000 lines - Create as many jobs, one for each file - Load the ID'th line from the respective file and execute it - Push all these jobs to the scheduler - Pro: Easier logic in each script - Con: Multiple job, I have to take care of submitting and waiting for the results in parallel. # Approach 3 - Create 10*10k jobs, let the scheduler deal with it - Every job executes one task (./binary input-file) - Pro: very simple jobs and scripts - Con: huge amount of jobs, can the scheduler handle that? I'm using approach 1 already and it works somewhat fine. That being said the script logic is rather involved and load balancing is not that great. I routinely have a handful of jobs at the end that run for 10 minutes or so longer than all the others where a single task is capped at one minute. This is pretty annoying. Also, we are exploring what the best-practice should be here... I just tried approach 2 and it did not go to well, even for only about 12k tasks. To try the scaling I made every array job 100 in size, so I tried to schedule about 120 jobs. While it went well for about 75 jobs, sbatch started to come back with the following afterwards: sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying and quickly afterwards: sbatch: error: Batch job submission failed: Resource temporarily unavailable I then tried to "relax" a bit and added a one second delay between the calls to sbatch... and it does not change everything. Thus I don't have a lot of hope for approach 3... Any comments or ideas? Best, Gereon -- Gereon Kremer Lehr- und Forschungsgebiet Theorie Hybrider Systeme RWTH Aachen Tel: +49 241 80 21243

1 0

Starting jobs with GUI on SLURM
by simon＠isf.rwth-aachen.de 21 Feb '19

21 Feb '19

Hello, Is it possible to activate X11-Forwaring on a job submitted via SLURM? LSF had the Option in #BSUB -XF But I cannot find anything equivalent for SLURM in the documentation For some commercial applications like Wolfram Mathematica or Abaqus I would need the GUI interface to run my jobs. Also I should add, that i would like to run these Applications in parallel with OpenMP, which was possible on LSF. Is it not planned to supply this feature? Best Regards, Marek

2 1

problems with parallelization of Wien2k DFT code
by Pavel Ondračka 19 Feb '19

19 Feb '19

Dear admins and users, I'm using the Wien2k DFT code, which is basically a suite of various subprograms each one doing one specific thing glued together by various c shell scripts. There are parallellized at a hybrid MPI/openMP level (+ some parts just spawn multiple nonMPI processes with OpenMP threads). The problem is that various subtasks need different number of MPI/openMP processes/threads for optimal speed. I call the software by csh script run_lapw which will in turn call programs like lapw0, lapw1, lapw2 and mixer. The Wien2k c shell glue is doing the parallelization of subprograms dispatch based on some internal configuration files which I'm generating on the fly based on the actual allocated nodes as obtained from SLURM. It sets the OMP_NUM_THREADS properly and calls the mpirun with the correct number of processes and proper -machinefile <file> generated based on the input from SLURM and optimal configuration for each subprogram. Some specific examples of optimal parallelization for the 48cpu node for different subprograms being called from indide the run_lapw script: - 8 MPI processes with 12omp threads each - 32 noncomunicating processes with 2omp threads - 49 MPI processes without threading - single process with 48 omp threads However what I see sometime is that the mpirun calls (intelmpi) from the c shell scripts are intercepted and modified resulting in multiple CPU processes (threads) being bound to single CPU and suboptimal performance. So basically I need a way to tell the SLURM to just allocate me a full node and not mess up the mpirun calls from inside the csh scripts or don't do any CPU pinning... Any advice would be appreciated. Best regards Pavel Ondračka

1 0

Submitting many jobs using batch script
by Johannes Sauer 19 Feb '19

19 Feb '19

Hi, for our simulations we have a simulation manager. For LSF this used to issue a bsub command for each simulation. We're not using array jobs for this as it has some more requirements. According to https://doc.itc.rwth-aachen.de/download/attachments/39160017/Slurm%20and%20… I can not simply replace bsub with srun, I need to do a sbatch. I believe this should work: Run the simulation manager with sbatch, then it should be able to do srun for the different simulations. Will this work? Best Johannes -- M.Sc. Johannes Sauer Researcher Institut fuer Nachrichtentechnik RWTH Aachen University Melatener Str. 23 52074 Aachen Tel +49 241 80-27678 Fax +49 241 80-22196 sauer(a)ient.rwth-aachen.de http://www.ient.rwth-aachen.de

2 3

Short Maintenance
by Marcus Wagner 19 Feb '19

19 Feb '19

Dear Users, there will be now a short period, when you cannot submit jobs and no jobs get scheduled. With kind regards Marcus Wagner -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner(a)itc.rwth-aachen.de www.itc.rwth-aachen.de

1 4

Submission rejections when using ntasks-per-node
by Sven Hansen 15 Feb '19

15 Feb '19

Dear users, thanks to several reports we have discovered a problem when trying to submit multi-node jobs that request more than 24 tasks per node. In general a resource request looking like this should work perfectly fine: (...) #SBATCH --nodes=5 #SBATCH --ntasks-per-node=48 (...) Theoretically this would allow you to make full use of 5 nodes. Currently, however, sbatch rejects such job scripts claiming that there were no hosts suitable for dispatchment. Despite this, the following request (...) #SBATCH --ntasks=240 (...) works as intended while being semantically equal in this scenario. We are not sure exactly what is causing this problem but do suspect a bug in slurm, possibly in conjunction with the Skylake-SP CPUs. If you are affected, we recommend to use --ntasks only for the time being. We will change the documentation accordingly so that you can build your job scripts upon correct templates. The problem has been relayed to the developers, we will have to wait for their assessment. Please excuse any inconvenience. Best, Sven -- Sven Hansen IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen (Germany) Tel.: + 49 241 80-29114 s.hansen(a)itc.rwth-aachen.de www.itc.rwth-aachen.de

1 1

Re: Multi-Node ANSYS simulations
by Gier, Thomas 12 Feb '19

12 Feb '19

Hello Marcus, unfortunately I get essentially the same problem. srun spawns #cores instances of the CFX solver, every one of which tries to access all cores. Since they still try to communicate with the other node over ssh the result is the same error as below. Regards, Thomas Von: Marcus Wagner [mailto:wagner@itc.rwth-aachen.de] Gesendet: Dienstag, 12. Februar 2019 15:21 An: claix18-slurm-pilot(a)lists.rwth-aachen.de Betreff: [claix18-slurm-pilot] Re: Multi-Node ANSYS simulations Dear Thomas, could you please test the following: srun cfx5solve -batch -parallel -partition $SLURM_NTASKS -def job.def -par-dist "$CFXHOSTS" -start-method "Intel MPI Distributed Parallel". Best Marcus On 2/12/19 11:10 AM, Gier, Thomas wrote: Hello, I'm having issues running ANSYS CFX calculations across multiple nodes. Single-node simulations run fine, but multi-node configurations crash because ssh connections are being denied: " +--------------------------------------------------------------------+ | An error has occurred in cfx5solve: | | | | Remote connection to ncm0791.hpc.itc.rwthaachen.de | | (ncm0791.hpc.itc.rwth-aachen.de) could not be started, or exited | | with return code 255. It gave the following output: | | | | Permission denied (publickey,gssapi-keyex,gssapi-with-mic,pass- | | word,hostbased). | | | | Check that you have typed the hostname correctly, and that you | | have an account "tg084461" on the specified host with access | | permission from this host. You can use the following command to | | check the connection to a UNIX machine: | | | | ssh ncm0791.hpc.itc.rwth-aachen.de uname | +--------------------------------------------------------------------+" Am I missing something in my submission script, or is this a cluster config issue? Regards, Thomas Gier _______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de<mailto:claix18-slurm-pilot@lists.rwth-aachen.de> To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de<mailto:claix18-slurm-pilot-leave@lists.rwth-aachen.de> -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner(a)itc.rwth-aachen.de<mailto:wagner@itc.rwth-aachen.de> www.itc.rwth-aachen.de<http://www.itc.rwth-aachen.de>

2 1

Multi-Node ANSYS simulations
by Gier, Thomas 12 Feb '19

12 Feb '19

Hello, I'm having issues running ANSYS CFX calculations across multiple nodes. Single-node simulations run fine, but multi-node configurations crash because ssh connections are being denied: " +--------------------------------------------------------------------+ | An error has occurred in cfx5solve: | | | | Remote connection to ncm0791.hpc.itc.rwthaachen.de | | (ncm0791.hpc.itc.rwth-aachen.de) could not be started, or exited | | with return code 255. It gave the following output: | | | | Permission denied (publickey,gssapi-keyex,gssapi-with-mic,pass- | | word,hostbased). | | | | Check that you have typed the hostname correctly, and that you | | have an account "tg084461" on the specified host with access | | permission from this host. You can use the following command to | | check the connection to a UNIX machine: | | | | ssh ncm0791.hpc.itc.rwth-aachen.de uname | +--------------------------------------------------------------------+" Am I missing something in my submission script, or is this a cluster config issue? Regards, Thomas Gier

2 1

General questions and problems with 256 node jobs
by Sebastian Achilles 05 Feb '19

05 Feb '19

Hi, a few questions/points form my side: - Claix 18 is using Intel OPA. What network topology is used? I guess it is a Fat tree. Is it blocking or non-blocking? Is it 1:2 blocking as on Claix 16? - Have you configured topology-aware resource allocation within the SLURM scheduler, in other words does the scheduler knows the topology and tries to minimize the hop-count? - I assume TurboBoost is enabled by default? Is it possible (or will it be possible in the future) to include an option to switch TurboBoost off? E.g. on JURECA it is possible to disable TurboBoost with `#SBATCH --disable-turbomode` for measurements. Otherwise it is possible to set frequencies with likwid? - I tried to run jobs with 256 nodes. I am getting an MPI error (cf. job 92414): nrm008.hpc.itc.rwth-aachen.de.233340PSM2 no hfi units are active (err=23) [245] MPI startup(): tmi fabric is not available and fallback fabric is not enabled Any ideas where this is coming from? Should I manually adjust I_MPI_FABRICS? I don't want to set the fallback fabric, since this would be TCP and would significantly impact performance. Affected jobs are not canceled but are running into the time limit. Theses errors are only occurring on the nrm nodes, not on the ncm nodes. Could there be a problem with the nrm nodes? Currently I am using the partition c18m, which contains both nodes types. How can I only select ncm nodes? - Historic jobs can not be viewed with scontrol or sstat. sacct on the other hand works. For example: $ scontrol show job 90949 slurm_load_jobs error: Invalid job id specified $ sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 90945 AveCPU AvePages AveRSS AveVMSize JobID ---------- ---------- ---------- ---------- ------------ sstat: error: couldn't get steps for job 90945 - Some of my jobs don't appear in the queue and are not scheduled, even if sbatch returns `Submitted batch job 92365` - You are using a hand-build module system. One issue with this approach is that dependencies are not dissolved properly. For example loading the module python does something unexpected. A short example: $ module load intel; ldd main.x | grep mkl intel/19.0 already loaded, doing nothing [ WARNING ] libmkl_intel_lp64.so => /opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_lp64.so (0x00002ac58b9ee000) libmkl_core.so => /opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_core.so (0x00002ac58c53c000) libmkl_intel_thread.so => /opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_thread.so (0x00002ac5906c8000) $ module load python; ldd main.x | grep mkl Loading python 2.7.12 [ OK ] The SciPy Stack available: http://www.scipy.org/stackspec.html Build with GCC compilers. libmkl_intel_lp64.so => /usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_lp64.so (0x00002abea70ad000) libmkl_core.so => /usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_core.so (0x00002abea7bcb000) libmkl_intel_thread.so => /usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_thread.so (0x00002abea96ba000) Using mkl_get_version_string() shows that the python mkl is version 2017.0.0 instead of the expected 2019.0.1 version that should be loaded. A different approach to a hand-build module system would be using easybuild to create the module system. This would avoid such issues. The blue print paper can be found here: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7830454 JSC has published there easybuild configuration on github: https://github.com/easybuilders/JSC The config files from HPC-UGent are also publicly available. Best, Sebastian

3 6

Network Performance
by Sebastian Achilles 05 Feb '19

05 Feb '19

Hi all, I have looked into the Network Performance on CLAIX18. I have measured latency and bandwidth for intra and inter node communication. I have used the Intel IMB PingPong Benchmark complied with the modules intel/19.0 and intelmpi/2019. To get a sufficient statistic I have submitted 64 jobs with 1 node using 2 tasks and 64 jobs with 2 nodes using 1 task each, respectively. The scheduler has started the jobs on different sets of nodes. I have attached the results, showing the configuration and the average, min and max of the measurement. Let's first look at the inter node communication: I have measured an average latency of 2.12 usec. In the best case I measured 1 usec and in the maximum is 7.1 usec. The bandwidth is on average 6488 Mbytes/sec. The maximum is 11995 Mbytes/sec and minimum is 2483 Mbytes/sec. The latency for intra node communication looks okay, however the bandwidth shows variation. On average theses results don't correspond with the advertised values from Intel. Either I have done something wrong or I haven't understood the topology or there is a problem with the machine. Have you run such a benchmark as well? Can you observe something similar? @Marcus: To get a better understanding of the machine, could you please share a bit more information on the network topology: - How many levels does the tree have? - On which level is the tree pruned? - Could you send me the connectivity file / connection map, e.g. a list of cables connecting the nodes, edge and core switches? I would like to add the hop count information into my result. (I have a script for computing the hop count from a connection map. Depending on the format I just need to adjust the reading routine) Cheers, Sebastian

2 1