Hi all,
apparently we have a limit of 100 concurrent Jobs per User on slurm.
This seems reasonable for the (shared) main cluster as we wouldn't
schedule more than that anyway as other users want to use the system as
well.
The situation is somewhat different for the integrated hosting part
however (though this comes with a few questions from my side):
My understanding is that we have exclusive access to our hardware. (Is
this the case? Or do we only have "prioritized" access and the hardware
is used by others as well if idle?)
Anyway we would expect that a user (from our ih project) can use all our
hardware provided that no other user (from our ih project) is using it
as well.
If we do the math however we provide more than 300 cores, but only 100
jobs are scheduled. At the same time we will probably have only one or
two user using our partition quite frequently, essentially wasting time...
Long story short:
Could we increase this limit (at least for ih partitions)?
Best,
Gereon
--
Gereon Kremer
Lehr- und Forschungsgebiet Theorie Hybrider Systeme
RWTH Aachen
Tel: +49 241 80 21243
Hi,
I've recently noticed that some of my jobs do hang (or at least take a many hours to finish while normally they run around 3 hours). The problematic part seems to be in rsync calls.
Since I have bad experience with the network discs throughput and stability, I do something like this in my jobfiles:
CASE=$(basename $SLURM_SUBMIT_DIR)
TMP=/w0/tmp/slurm_$(username).$SLURM_JOB_ID/
rsync -a $SLURM_SUBMIT_DIR $TMP
cd $TMP/$CASE
# DO WORK
rsync -a --exclude '*.dayfile' $TMP/$CASE/ $SLURM_SUBMIT_DIR
scp $TMP/$CASE/*.dayfile $SLURM_SUBMIT_DIR/
rm -rf $TMP/$CASE/
i.e., I copy data to the local drive to speed calculations up. Mostly it works OK, however sometime the job hangs, connecting to the node with "srun --jobid <jobid> --pty /bin/zsh" and running ps ux shows:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
sl119982 212008 0.0 0.0 124916 2032 ? S 11:56 0:00 /bin/zsh /var/spool/slurm/job437663/slurm_script
sl119982 218531 0.0 0.0 118248 1532 ? S 12:01 0:00 rsync -a --exclude *.dayfile /w0/tmp/slurm_sl119982.437663//6-4/ /rwthfs/rz/cluster/work/sl119982/TiAlON/x_0.6667_y_0.0000_g_0.0625/6-4
sl119982 218532 0.0 0.0 117928 876 ? S 12:01 0:00 rsync -a --exclude *.dayfile /w0/tmp/slurm_sl119982.437663//6-4/ /rwthfs/rz/cluster/work/sl119982/TiAlON/x_0.6667_y_0.0000_g_0.0625/6-4
sl119982 218533 0.0 0.0 118188 768 ? D 12:01 0:00 rsync -a --exclude *.dayfile /w0/tmp/slurm_sl119982.437663//6-4/ /rwthfs/rz/cluster/work/sl119982/TiAlON/x_0.6667_y_0.0000_g_0.0625/6-4
sl119982 223991 0.0 0.0 127240 2440 pts/0 Ss 12:13 0:00 /bin/zsh
sl119982 224467 0.0 0.0 115588 2220 pts/0 S 12:13 0:00 bash
sl119982 233247 0.0 0.0 155380 1928 pts/0 R+ 12:28 0:00 ps ux
There are 3?!? rsync processes running and all are sleeping? I have no idea what is going on. I tried to attach to the rsync process to see what is going on:
gdb attach 218533
bt
#0 0x00002b2909102620 in __close_nocancel () from /lib64/libc.so.6
#1 0x0000564544f79de6 in recv_files ()
#2 0x0000564544f84161 in do_recv ()
#3 0x0000564544f849ac in start_server ()
#4 0x0000564544f84af5 in child_main ()
#5 0x0000564544fa3ce9 in local_child ()
#6 0x0000564544f67e9b in main ()
And actually after detaching from the process it somehow got started, switched back to running status and everything finished.
Any ideas? The jobid of the last stuck job was 437663 if anyone wants to investigate. I'll send more jobids when I see this again...
Best regards
Pavel
Hi,
I have a question about the hyperthreading. Previously when I wanted to
allocate full node with hyperthreading I did this:
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-core=2
#SBATCH --ntasks-per-node=96
#SBATCH --mem=180G
This no longer works, i.e., setting ntasks per node to anything higher
than 48 will show
sbatch: error: Batch job submission failed: Requested node
configuration is not available
even when the --ntasks-per-core=2 is set. Any ideas?
How can I allocate full node with hyperthreading?
Best regards
Pavel Ondračka
Hi,
I used --grep gpu:1 to ask for a node with GPU available, then I got below
information:
sbatch: error: Batch job submission failed: Requested node configuration is
not available
does this mean that the GPU cluster is not ready?
Is it temporary or is there any plan to bring GPU cluster back?
Best wishes,
Li
*______________________________*
Zhijian Li
Institute for Computational Genomics
RWTH Aachen University
Pauwelsstrasse 19
52074 Aachen, Germany
Hi,
I observe a weird behaviour when using different paths to a binary and
an input file.
As we know $HOME and $WORK resolve to /home/.../ and /work/... though
/home/ and /work/ are symlinks to /rwthfs/rz/cluster/...
So it should not make a difference, right?
I have a (statically linked) binary that behaves in a certain way (that
I want to debug...) If I call it with the canonical paths I get:
% time /rwthfs/rz/cluster/home/gk809425/smtrat_aklima/build/smtrat_2
/rwthfs/rz/cluster/work/gk809425/benchmarks/QF_NRA/hycomp/ball_count_2d_hill.01.seq_lazy_linear_enc_lemmas_global_4.smt2
(error "expected sat, but returned unsat")
/rwthfs/rz/cluster/home/gk809425/smtrat_aklima/build/smtrat_2 64.78s
user 0.14s system 99% cpu 1:05.08 total
So it terminates after about 65 seconds. (repeatably)
Now I use the non-canonical paths:
% pwd
/home/gk809425/smtrat_aklima/build
% time ./smtrat_2
$WORK/benchmarks/QF_NRA/hycomp/ball_count_2d_hill.01.seq_lazy_linear_enc_lemmas_global_4.smt2
Which does not terminate for more than four minutes...
Also it is CPU-bound, so it does not seem to wait for IO.
Just to be sure: those commands were executed in the same session, so it
is the same environment in terms of loaded modules, env variables, etc.
Can anyone guess what is going on here?
Best,
Gereon
--
Gereon Kremer
Lehr- und Forschungsgebiet Theorie Hybrider Systeme
RWTH Aachen
Tel: +49 241 80 21243
Dear all,
sorry for the late information.
There will be a maintenance tomorrow from 9 o'clock on. It might last
until the afternoon, but I expect it to end earlier.
Sorry for the inconvenience.
Best
Marcus
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner(a)itc.rwth-aachen.de
www.itc.rwth-aachen.de
Hello all,
I received this following new error. Apparently the run was using a host, which was specified in $CFXHOSTS, which i have no access to.
It is the first time i encountered this error, but i have been using RWTH cluster for calculations only for the past few weeks. Did something change recently in $CFXHOSTS, or is $CFXHOSTS not yet updated for the new machines?
Best Regards,
Marek
--------------------------------------------------------------------------------------------------------------------
An error has occurred in cfx5solve:
Remote connection to ncm0394 could not be started, or exited with return
code 255. It gave the following output:
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).
Check that you have typed the hostname correctly, and that you have an
account "ms958471" on the specified host with access permission from this
host. You can use the following command to check the connection to a UNIX
machine:
ssh ncm0394 uname
or the following command if it is a Windows machine:
ssh ncm0394 cmd /c echo working
An error has occurred in cfx5solve:
The architecture string for host ncm0394 could not be determined.
cfx5solve -def 2019-02-27_Cathode_Mok_init.def -par-dist "$CFXHOSTS" -ccl 2.97s user 0.23s system 56% cpu 5.655 total
------------------------------------------------------------------------------------
when checking my access using my HPC-password:
ms958471@login18-x-1:~[505]$ ssh ncm0394 uname
ms958471@ncm0394's password:
Permission denied, please try again.
Hi,
I'm seeing issues with extremely slow IO on $HPCWORK on login18-1 right
now. I'm writing two files there (one only a few K, the other one
somewhat larger...)
Right now my process is stuck writing the first (small file) with 146
bytes already written. Meanwhile the process is at 100% CPU (and will be
killed at some point).
I can confidently say that my process is not actually CPU-bound at this
point in the code and is really just trying to write stuff to this file...
Best,
Gereon
--
Gereon Kremer
Lehr- und Forschungsgebiet Theorie Hybrider Systeme
RWTH Aachen
Tel: +49 241 80 21243
Of course you can expect the scheduler to be fast at processing a very
large number of jobs, but the reality seams to be that it is not. Like
spawning threads in a multi-threaded program it works fine for a
certain number of threads/second but if you try to spawn too many you
will encounter a point where the overhead of launching a thread eats
up it's benefits in terms of parallelization. The solution to this
problem is to use thread pools or make the chunks of work larger. The
way I see it you have reached this point of where spawning a large
number of jobs causes noticeable overhead and is thus limited by the
schedulers configuration. Maybe the admins can tune the configuration
to increase the maximum allowed number of jobs, but if the system is
already at it's limit I think you need to consider other options.
Greetings,
Eugen
On Mon, Feb 25, 2019 at 11:16 PM Philipp Berger
<berger(a)cs.rwth-aachen.de> wrote:
>
> Dear Eugen,
>
> while this would potentially solve our problem, we _/do not want to
> write our own scheduler/_!
> This is what SLURM should do. We are still a bit puzzled as to why our
> use case is so outlandish - our initial expectation was to find a
> matrix-job support in SLURM.
> Our Array-Job is already the result of us projecting our matrix job
> (solvers x problems x configurations) down into a single-column vector.
> Ideally, that would not be necessary. But okay, this we can deal with.
> This whole striping & scheduling business on the other hand... In my
> mind, "hiding" jobs (or rather, granularity) from the scheduler can only
> lead to problems -- and adds complexity to the user side which, again,
> can only lead to problems and sub-par performance.
>
> Kind regards,
> Philipp
>
> Am 25.02.2019 um 18:25 schrieb Eugen Beck:
> > Hi Gereon,
> >
> > if you worry about load balancing in scenario 1 what you could do is
> > use a central syncronization tool like a db where submitted jobs can
> > fetch one task atomically and execute it. Once there are no more tasks
> > to fetch from the DB the job ends. But I'm not sure what network
> > requests the clusters firewall allows. And it would be more difficult
> > to setup.
> >
> > Greetings,
> > Eugen
> >
> > On Mon, Feb 25, 2019 at 6:14 PM Gereon Kremer
> > <gereon.kremer(a)cs.rwth-aachen.de> wrote:
> >> Hello,
> >>
> >> following the discussion at the end of todays workshop I tried how the
> >> scheduler behaves when issuing a larger amount of jobs (Marcus
> >> essentially told me I could use approach 3 as detailed below). To frame
> >> my question, here is what want to do and how I try to do it (numbers
> >> just to get the magnitude):
> >>
> >> # Problem
> >> 10 Binaries, 10k input files. Run every binary on every input file, and
> >> collect all the results (= parse stdout).
> >>
> >> It seems array jobs are the tool for that, however the size of an array
> >> job is capped at 1000, apparently because larger jobs make the scheduler
> >> slow.
> >>
> >> # Approach 1
> >> - Create one file with 10*10k lines (./binary input-file)
> >> - Create one job with 1000 array jobs
> >> - Let ID be the id of the current array job
> >> - Identify the slice (10*10k) / 1000 * ID .. (10*10k) / 1000 * (ID + 1)
> >> - Execute all lines from the slice sequentially
> >> - Pro: Only one job, no scheduling hassle on the user side.
> >> - Con: weird script logic, 100 individual tasks in one scheduled array
> >> job, sometimes bad load balancing (i.e. one job takes way longer than
> >> the others)
> >>
> >> # Approach 2
> >> - Create (10*10k)/1000 files, each containing 1000 lines
> >> - Create as many jobs, one for each file
> >> - Load the ID'th line from the respective file and execute it
> >> - Push all these jobs to the scheduler
> >> - Pro: Easier logic in each script
> >> - Con: Multiple job, I have to take care of submitting and waiting for
> >> the results in parallel.
> >>
> >> # Approach 3
> >> - Create 10*10k jobs, let the scheduler deal with it
> >> - Every job executes one task (./binary input-file)
> >> - Pro: very simple jobs and scripts
> >> - Con: huge amount of jobs, can the scheduler handle that?
> >>
> >>
> >> I'm using approach 1 already and it works somewhat fine. That being said
> >> the script logic is rather involved and load balancing is not that
> >> great. I routinely have a handful of jobs at the end that run for 10
> >> minutes or so longer than all the others where a single task is capped
> >> at one minute. This is pretty annoying. Also, we are exploring what the
> >> best-practice should be here...
> >>
> >> I just tried approach 2 and it did not go to well, even for only about
> >> 12k tasks. To try the scaling I made every array job 100 in size, so I
> >> tried to schedule about 120 jobs.
> >> While it went well for about 75 jobs, sbatch started to come back with
> >> the following afterwards:
> >>
> >> sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying
> >>
> >> and quickly afterwards:
> >>
> >> sbatch: error: Batch job submission failed: Resource temporarily unavailable
> >>
> >>
> >> I then tried to "relax" a bit and added a one second delay between the
> >> calls to sbatch... and it does not change everything.
> >> Thus I don't have a lot of hope for approach 3...
> >>
> >>
> >> Any comments or ideas?
> >>
> >> Best,
> >> Gereon
> >>
> >>
> >> --
> >> Gereon Kremer
> >> Lehr- und Forschungsgebiet Theorie Hybrider Systeme
> >> RWTH Aachen
> >> Tel: +49 241 80 21243
> >>
> >> _______________________________________________
> >> claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de
> >> To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de
> > _______________________________________________
> > claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de
> > To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de
>
>
>
> _______________________________________________
> claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de
> To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de
Hi Gereon,
if you worry about load balancing in scenario 1 what you could do is
use a central syncronization tool like a db where submitted jobs can
fetch one task atomically and execute it. Once there are no more tasks
to fetch from the DB the job ends. But I'm not sure what network
requests the clusters firewall allows. And it would be more difficult
to setup.
Greetings,
Eugen
On Mon, Feb 25, 2019 at 6:14 PM Gereon Kremer
<gereon.kremer(a)cs.rwth-aachen.de> wrote:
>
> Hello,
>
> following the discussion at the end of todays workshop I tried how the
> scheduler behaves when issuing a larger amount of jobs (Marcus
> essentially told me I could use approach 3 as detailed below). To frame
> my question, here is what want to do and how I try to do it (numbers
> just to get the magnitude):
>
> # Problem
> 10 Binaries, 10k input files. Run every binary on every input file, and
> collect all the results (= parse stdout).
>
> It seems array jobs are the tool for that, however the size of an array
> job is capped at 1000, apparently because larger jobs make the scheduler
> slow.
>
> # Approach 1
> - Create one file with 10*10k lines (./binary input-file)
> - Create one job with 1000 array jobs
> - Let ID be the id of the current array job
> - Identify the slice (10*10k) / 1000 * ID .. (10*10k) / 1000 * (ID + 1)
> - Execute all lines from the slice sequentially
> - Pro: Only one job, no scheduling hassle on the user side.
> - Con: weird script logic, 100 individual tasks in one scheduled array
> job, sometimes bad load balancing (i.e. one job takes way longer than
> the others)
>
> # Approach 2
> - Create (10*10k)/1000 files, each containing 1000 lines
> - Create as many jobs, one for each file
> - Load the ID'th line from the respective file and execute it
> - Push all these jobs to the scheduler
> - Pro: Easier logic in each script
> - Con: Multiple job, I have to take care of submitting and waiting for
> the results in parallel.
>
> # Approach 3
> - Create 10*10k jobs, let the scheduler deal with it
> - Every job executes one task (./binary input-file)
> - Pro: very simple jobs and scripts
> - Con: huge amount of jobs, can the scheduler handle that?
>
>
> I'm using approach 1 already and it works somewhat fine. That being said
> the script logic is rather involved and load balancing is not that
> great. I routinely have a handful of jobs at the end that run for 10
> minutes or so longer than all the others where a single task is capped
> at one minute. This is pretty annoying. Also, we are exploring what the
> best-practice should be here...
>
> I just tried approach 2 and it did not go to well, even for only about
> 12k tasks. To try the scaling I made every array job 100 in size, so I
> tried to schedule about 120 jobs.
> While it went well for about 75 jobs, sbatch started to come back with
> the following afterwards:
>
> sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying
>
> and quickly afterwards:
>
> sbatch: error: Batch job submission failed: Resource temporarily unavailable
>
>
> I then tried to "relax" a bit and added a one second delay between the
> calls to sbatch... and it does not change everything.
> Thus I don't have a lot of hope for approach 3...
>
>
> Any comments or ideas?
>
> Best,
> Gereon
>
>
> --
> Gereon Kremer
> Lehr- und Forschungsgebiet Theorie Hybrider Systeme
> RWTH Aachen
> Tel: +49 241 80 21243
>
> _______________________________________________
> claix18-slurm-pilot mailing list -- claix18-slurm-pilot(a)lists.rwth-aachen.de
> To unsubscribe send an email to claix18-slurm-pilot-leave(a)lists.rwth-aachen.de