February 2019 - claix18-slurm-pilot

Different behaviour if using /home/ or /rwthfs/...
by Gereon Kremer 05 Mar '19

05 Mar '19

Hi, I observe a weird behaviour when using different paths to a binary and an input file. As we know $HOME and $WORK resolve to /home/.../ and /work/... though /home/ and /work/ are symlinks to /rwthfs/rz/cluster/... So it should not make a difference, right? I have a (statically linked) binary that behaves in a certain way (that I want to debug...) If I call it with the canonical paths I get: % time /rwthfs/rz/cluster/home/gk809425/smtrat_aklima/build/smtrat_2 /rwthfs/rz/cluster/work/gk809425/benchmarks/QF_NRA/hycomp/ball_count_2d_hill.01.seq_lazy_linear_enc_lemmas_global_4.smt2 (error "expected sat, but returned unsat") /rwthfs/rz/cluster/home/gk809425/smtrat_aklima/build/smtrat_2 64.78s user 0.14s system 99% cpu 1:05.08 total So it terminates after about 65 seconds. (repeatably) Now I use the non-canonical paths: % pwd /home/gk809425/smtrat_aklima/build % time ./smtrat_2 $WORK/benchmarks/QF_NRA/hycomp/ball_count_2d_hill.01.seq_lazy_linear_enc_lemmas_global_4.smt2 Which does not terminate for more than four minutes... Also it is CPU-bound, so it does not seem to wait for IO. Just to be sure: those commands were executed in the same session, so it is the same environment in terms of loaded modules, env variables, etc. Can anyone guess what is going on here? Best, Gereon -- Gereon Kremer Lehr- und Forschungsgebiet Theorie Hybrider Systeme RWTH Aachen Tel: +49 241 80 21243

2 5

Remote connection to ncm0394 could not be started, or exited with return code 255
by simon＠isf.rwth-aachen.de 27 Feb '19

27 Feb '19

Hello all, I received this following new error. Apparently the run was using a host, which was specified in $CFXHOSTS, which i have no access to. It is the first time i encountered this error, but i have been using RWTH cluster for calculations only for the past few weeks. Did something change recently in $CFXHOSTS, or is $CFXHOSTS not yet updated for the new machines? Best Regards, Marek -------------------------------------------------------------------------------------------------------------------- An error has occurred in cfx5solve: Remote connection to ncm0394 could not be started, or exited with return code 255. It gave the following output: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased). Check that you have typed the hostname correctly, and that you have an account "ms958471" on the specified host with access permission from this host. You can use the following command to check the connection to a UNIX machine: ssh ncm0394 uname or the following command if it is a Windows machine: ssh ncm0394 cmd /c echo working An error has occurred in cfx5solve: The architecture string for host ncm0394 could not be determined. cfx5solve -def 2019-02-27_Cathode_Mok_init.def -par-dist "$CFXHOSTS" -ccl 2.97s user 0.23s system 56% cpu 5.655 total ------------------------------------------------------------------------------------ when checking my access using my HPC-password: ms958471@login18-x-1:~[505]$ ssh ncm0394 uname ms958471@ncm0394's password: Permission denied, please try again.

2 1

HPCWORK issue on login18-1?
by Gereon Kremer 26 Feb '19

26 Feb '19

Hi, I'm seeing issues with extremely slow IO on $HPCWORK on login18-1 right now. I'm writing two files there (one only a few K, the other one somewhat larger...) Right now my process is stuck writing the first (small file) with 146 bytes already written. Meanwhile the process is at 100% CPU (and will be killed at some point). I can confidently say that my process is not actually CPU-bound at this point in the code and is really just trying to write stuff to this file... Best, Gereon -- Gereon Kremer Lehr- und Forschungsgebiet Theorie Hybrider Systeme RWTH Aachen Tel: +49 241 80 21243

1 1

Re: Starting large amounts of jobs
by Eugen Beck 25 Feb '19

25 Feb '19

Of course you can expect the scheduler to be fast at processing a very large number of jobs, but the reality seams to be that it is not. Like spawning threads in a multi-threaded program it works fine for a certain number of threads/second but if you try to spawn too many you will encounter a point where the overhead of launching a thread eats up it's benefits in terms of parallelization. The solution to this problem is to use thread pools or make the chunks of work larger. The way I see it you have reached this point of where spawning a large number of jobs causes noticeable overhead and is thus limited by the schedulers configuration. Maybe the admins can tune the configuration to increase the maximum allowed number of jobs, but if the system is already at it's limit I think you need to consider other options. Greetings, Eugen On Mon, Feb 25, 2019 at 11:16 PM Philipp Berger <berger(a)cs.rwth-aachen.de> wrote: > > Dear Eugen, > > while this would potentially solve our problem, we _/do not want to > write our own scheduler/_! > This is what SLURM should do. We are still a bit puzzled as to why our > use case is so outlandish - our initial expectation was to find a > matrix-job support in SLURM. > Our Array-Job is already the result of us projecting our matrix job > (solvers x problems x configurations) down into a single-column vector. > Ideally, that would not be necessary. But okay, this we can deal with. > This whole striping & scheduling business on the other hand... In my > mind, "hiding" jobs (or rather, granularity) from the scheduler can only > lead to problems -- and adds complexity to the user side which, again, > can only lead to problems and sub-par performance. > > Kind regards, > Philipp > > Am 25.02.2019 um 18:25 schrieb Eugen Beck: > > Hi Gereon, > > > > if you worry about load balancing in scenario 1 what you could do is > > use a central syncronization tool like a db where submitted jobs can > > fetch one task atomically and execute it. Once there are no more tasks > > to fetch from the DB the job ends. But I'm not sure what network > > requests the clusters firewall allows. And it would be more difficult > > to setup. > > > > Greetings, > > Eugen > > > > On Mon, Feb 25, 2019 at 6:14 PM Gereon Kremer > > <gereon.kremer(a)cs.rwth-aachen.de> wrote: > >> Hello, > >> > >> following the discussion at the end of todays workshop I tried how the > >> scheduler behaves when issuing a larger amount of jobs (Marcus > >> essentially told me I could use approach 3 as detailed below). To frame > >> my question, here is what want to do and how I try to do it (numbers > >> just to get the magnitude): > >> > >> # Problem > >> 10 Binaries, 10k input files. Run every binary on every input file, and > >> collect all the results (= parse stdout). > >> > >> It seems array jobs are the tool for that, however the size of an array > >> job is capped at 1000, apparently because larger jobs make the scheduler > >> slow. > >> > >> # Approach 1 > >> - Create one file with 10*10k lines (./binary input-file) > >> - Create one job with 1000 array jobs > >> - Let ID be the id of the current array job > >> - Identify the slice (10*10k) / 1000 * ID .. (10*10k) / 1000 * (ID + 1) > >> - Execute all lines from the slice sequentially > >> - Pro: Only one job, no scheduling hassle on the user side. > >> - Con: weird script logic, 100 individual tasks in one scheduled array > >> job, sometimes bad load balancing (i.e. one job takes way longer than > >> the others) > >> > >> # Approach 2 > >> - Create (10*10k)/1000 files, each containing 1000 lines > >> - Create as many jobs, one for each file > >> - Load the ID'th line from the respective file and execute it > >> - Push all these jobs to the scheduler > >> - Pro: Easier logic in each script > >> - Con: Multiple job, I have to take care of submitting and waiting for > >> the results in parallel. > >> > >> # Approach 3 > >> - Create 10*10k jobs, let the scheduler deal with it > >> - Every job executes one task (./binary input-file) > >> - Pro: very simple jobs and scripts > >> - Con: huge amount of jobs, can the scheduler handle that? > >> > >> > >> I'm using approach 1 already and it works somewhat fine. That being said > >> the script logic is rather involved and load balancing is not that > >> great. I routinely have a handful of jobs at the end that run for 10 > >> minutes or so longer than all the others where a single task is capped > >> at one minute. This is pretty annoying. Also, we are exploring what the > >> best-practice should be here... > >> > >> I just tried approach 2 and it did not go to well, even for only about > >> 12k tasks. To try the scaling I made every array job 100 in size, so I > >> tried to schedule about 120 jobs. > >> While it went well for about 75 jobs, sbatch started to come back with > >> the following afterwards: > >> > >> sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying > >> > >> and quickly afterwards: > >> > >> sbatch: error: Batch job submission failed: Resource temporarily unavailable > >> > >> > >> I then tried to "relax" a bit and added a one second delay between the > >> calls to sbatch... and it does not change everything. > >> Thus I don't have a lot of hope for approach 3... > >> > >> > >> Any comments or ideas? > >> > >> Best, > >> Gereon > >> > >> > >> -- > >> Gereon Kremer > >> Lehr- und Forschungsgebiet Theorie Hybrider Systeme > >> RWTH Aachen > >> Tel: +49 241 80 21243 > >> > >> _______________________________________________ > >> claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de > >> To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de > > _______________________________________________ > > claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de > > To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de > > > > _______________________________________________ > claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de > To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de

1 0

Re: Starting large amounts of jobs
by Eugen Beck 25 Feb '19

25 Feb '19

Hi Gereon, if you worry about load balancing in scenario 1 what you could do is use a central syncronization tool like a db where submitted jobs can fetch one task atomically and execute it. Once there are no more tasks to fetch from the DB the job ends. But I'm not sure what network requests the clusters firewall allows. And it would be more difficult to setup. Greetings, Eugen On Mon, Feb 25, 2019 at 6:14 PM Gereon Kremer <gereon.kremer(a)cs.rwth-aachen.de> wrote: > > Hello, > > following the discussion at the end of todays workshop I tried how the > scheduler behaves when issuing a larger amount of jobs (Marcus > essentially told me I could use approach 3 as detailed below). To frame > my question, here is what want to do and how I try to do it (numbers > just to get the magnitude): > > # Problem > 10 Binaries, 10k input files. Run every binary on every input file, and > collect all the results (= parse stdout). > > It seems array jobs are the tool for that, however the size of an array > job is capped at 1000, apparently because larger jobs make the scheduler > slow. > > # Approach 1 > - Create one file with 10*10k lines (./binary input-file) > - Create one job with 1000 array jobs > - Let ID be the id of the current array job > - Identify the slice (10*10k) / 1000 * ID .. (10*10k) / 1000 * (ID + 1) > - Execute all lines from the slice sequentially > - Pro: Only one job, no scheduling hassle on the user side. > - Con: weird script logic, 100 individual tasks in one scheduled array > job, sometimes bad load balancing (i.e. one job takes way longer than > the others) > > # Approach 2 > - Create (10*10k)/1000 files, each containing 1000 lines > - Create as many jobs, one for each file > - Load the ID'th line from the respective file and execute it > - Push all these jobs to the scheduler > - Pro: Easier logic in each script > - Con: Multiple job, I have to take care of submitting and waiting for > the results in parallel. > > # Approach 3 > - Create 10*10k jobs, let the scheduler deal with it > - Every job executes one task (./binary input-file) > - Pro: very simple jobs and scripts > - Con: huge amount of jobs, can the scheduler handle that? > > > I'm using approach 1 already and it works somewhat fine. That being said > the script logic is rather involved and load balancing is not that > great. I routinely have a handful of jobs at the end that run for 10 > minutes or so longer than all the others where a single task is capped > at one minute. This is pretty annoying. Also, we are exploring what the > best-practice should be here... > > I just tried approach 2 and it did not go to well, even for only about > 12k tasks. To try the scaling I made every array job 100 in size, so I > tried to schedule about 120 jobs. > While it went well for about 75 jobs, sbatch started to come back with > the following afterwards: > > sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying > > and quickly afterwards: > > sbatch: error: Batch job submission failed: Resource temporarily unavailable > > > I then tried to "relax" a bit and added a one second delay between the > calls to sbatch... and it does not change everything. > Thus I don't have a lot of hope for approach 3... > > > Any comments or ideas? > > Best, > Gereon > > > -- > Gereon Kremer > Lehr- und Forschungsgebiet Theorie Hybrider Systeme > RWTH Aachen > Tel: +49 241 80 21243 > > _______________________________________________ > claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de > To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de

2 1

Starting large amounts of jobs
by Gereon Kremer 25 Feb '19

25 Feb '19

Hello, following the discussion at the end of todays workshop I tried how the scheduler behaves when issuing a larger amount of jobs (Marcus essentially told me I could use approach 3 as detailed below). To frame my question, here is what want to do and how I try to do it (numbers just to get the magnitude): # Problem 10 Binaries, 10k input files. Run every binary on every input file, and collect all the results (= parse stdout). It seems array jobs are the tool for that, however the size of an array job is capped at 1000, apparently because larger jobs make the scheduler slow. # Approach 1 - Create one file with 10*10k lines (./binary input-file) - Create one job with 1000 array jobs - Let ID be the id of the current array job - Identify the slice (10*10k) / 1000 * ID .. (10*10k) / 1000 * (ID + 1) - Execute all lines from the slice sequentially - Pro: Only one job, no scheduling hassle on the user side. - Con: weird script logic, 100 individual tasks in one scheduled array job, sometimes bad load balancing (i.e. one job takes way longer than the others) # Approach 2 - Create (10*10k)/1000 files, each containing 1000 lines - Create as many jobs, one for each file - Load the ID'th line from the respective file and execute it - Push all these jobs to the scheduler - Pro: Easier logic in each script - Con: Multiple job, I have to take care of submitting and waiting for the results in parallel. # Approach 3 - Create 10*10k jobs, let the scheduler deal with it - Every job executes one task (./binary input-file) - Pro: very simple jobs and scripts - Con: huge amount of jobs, can the scheduler handle that? I'm using approach 1 already and it works somewhat fine. That being said the script logic is rather involved and load balancing is not that great. I routinely have a handful of jobs at the end that run for 10 minutes or so longer than all the others where a single task is capped at one minute. This is pretty annoying. Also, we are exploring what the best-practice should be here... I just tried approach 2 and it did not go to well, even for only about 12k tasks. To try the scaling I made every array job 100 in size, so I tried to schedule about 120 jobs. While it went well for about 75 jobs, sbatch started to come back with the following afterwards: sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying and quickly afterwards: sbatch: error: Batch job submission failed: Resource temporarily unavailable I then tried to "relax" a bit and added a one second delay between the calls to sbatch... and it does not change everything. Thus I don't have a lot of hope for approach 3... Any comments or ideas? Best, Gereon -- Gereon Kremer Lehr- und Forschungsgebiet Theorie Hybrider Systeme RWTH Aachen Tel: +49 241 80 21243

1 0

Starting jobs with GUI on SLURM
by simon＠isf.rwth-aachen.de 21 Feb '19

21 Feb '19

Hello, Is it possible to activate X11-Forwaring on a job submitted via SLURM? LSF had the Option in #BSUB -XF But I cannot find anything equivalent for SLURM in the documentation For some commercial applications like Wolfram Mathematica or Abaqus I would need the GUI interface to run my jobs. Also I should add, that i would like to run these Applications in parallel with OpenMP, which was possible on LSF. Is it not planned to supply this feature? Best Regards, Marek

2 1

problems with parallelization of Wien2k DFT code
by Pavel Ondračka 19 Feb '19

19 Feb '19

Dear admins and users, I'm using the Wien2k DFT code, which is basically a suite of various subprograms each one doing one specific thing glued together by various c shell scripts. There are parallellized at a hybrid MPI/openMP level (+ some parts just spawn multiple nonMPI processes with OpenMP threads). The problem is that various subtasks need different number of MPI/openMP processes/threads for optimal speed. I call the software by csh script run_lapw which will in turn call programs like lapw0, lapw1, lapw2 and mixer. The Wien2k c shell glue is doing the parallelization of subprograms dispatch based on some internal configuration files which I'm generating on the fly based on the actual allocated nodes as obtained from SLURM. It sets the OMP_NUM_THREADS properly and calls the mpirun with the correct number of processes and proper -machinefile <file> generated based on the input from SLURM and optimal configuration for each subprogram. Some specific examples of optimal parallelization for the 48cpu node for different subprograms being called from indide the run_lapw script: - 8 MPI processes with 12omp threads each - 32 noncomunicating processes with 2omp threads - 49 MPI processes without threading - single process with 48 omp threads However what I see sometime is that the mpirun calls (intelmpi) from the c shell scripts are intercepted and modified resulting in multiple CPU processes (threads) being bound to single CPU and suboptimal performance. So basically I need a way to tell the SLURM to just allocate me a full node and not mess up the mpirun calls from inside the csh scripts or don't do any CPU pinning... Any advice would be appreciated. Best regards Pavel Ondračka

1 0

Submitting many jobs using batch script
by Johannes Sauer 19 Feb '19

19 Feb '19

Hi, for our simulations we have a simulation manager. For LSF this used to issue a bsub command for each simulation. We're not using array jobs for this as it has some more requirements. According to https://doc.itc.rwth-aachen.de/download/attachments/39160017/Slurm%20and%20… I can not simply replace bsub with srun, I need to do a sbatch. I believe this should work: Run the simulation manager with sbatch, then it should be able to do srun for the different simulations. Will this work? Best Johannes -- M.Sc. Johannes Sauer Researcher Institut fuer Nachrichtentechnik RWTH Aachen University Melatener Str. 23 52074 Aachen Tel +49 241 80-27678 Fax +49 241 80-22196 sauer(a)ient.rwth-aachen.de http://www.ient.rwth-aachen.de

2 3

Short Maintenance
by Marcus Wagner 19 Feb '19

19 Feb '19

Dear Users, there will be now a short period, when you cannot submit jobs and no jobs get scheduled. With kind regards Marcus Wagner -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner(a)itc.rwth-aachen.de www.itc.rwth-aachen.de

1 4