[claix18-slurm-pilot] Starting large amounts of jobs

25 Feb 2019

      Hello,

following the discussion at the end of todays workshop I tried how the
scheduler behaves when issuing a larger amount of jobs (Marcus
essentially told me I could use approach 3 as detailed below). To frame
my question, here is what want to do and how I try to do it (numbers
just to get the magnitude):

# Problem
10 Binaries, 10k input files. Run every binary on every input file, and
collect all the results (= parse stdout).

It seems array jobs are the tool for that, however the size of an array
job is capped at 1000, apparently because larger jobs make the scheduler
slow.

# Approach 1
- Create one file with 10*10k lines (./binary input-file)
- Create one job with 1000 array jobs
- Let ID be the id of the current array job
- Identify the slice (10*10k) / 1000 * ID .. (10*10k) / 1000 * (ID + 1)
- Execute all lines from the slice sequentially
- Pro: Only one job, no scheduling hassle on the user side.
- Con: weird script logic, 100 individual tasks in one scheduled array
job, sometimes bad load balancing (i.e. one job takes way longer than
the others)

# Approach 2
- Create (10*10k)/1000 files, each containing 1000 lines
- Create as many jobs, one for each file
- Load the ID'th line from the respective file and execute it
- Push all these jobs to the scheduler
- Pro: Easier logic in each script
- Con: Multiple job, I have to take care of submitting and waiting for
the results in parallel.

# Approach 3
- Create 10*10k jobs, let the scheduler deal with it
- Every job executes one task (./binary input-file)
- Pro: very simple jobs and scripts
- Con: huge amount of jobs, can the scheduler handle that?

I'm using approach 1 already and it works somewhat fine. That being said
the script logic is rather involved and load balancing is not that
great. I routinely have a handful of jobs at the end that run for 10
minutes or so longer than all the others where a single task is capped
at one minute. This is pretty annoying. Also, we are exploring what the
best-practice should be here...

I just tried approach 2 and it did not go to well, even for only about
12k tasks. To try the scaling I made every array job 100 in size, so I
tried to schedule about 120 jobs.
While it went well for about 75 jobs, sbatch started to come back with
the following afterwards:

sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying

and quickly afterwards:

sbatch: error: Batch job submission failed: Resource temporarily unavailable

I then tried to "relax" a bit and added a one second delay between the
calls to sbatch... and it does not change everything.
Thus I don't have a lot of hope for approach 3...

Any comments or ideas?

Best,
Gereon

-- 
Gereon Kremer
Lehr- und Forschungsgebiet Theorie Hybrider Systeme
RWTH Aachen
Tel: +49 241 80 21243