Hi Gereon, I was able to modify our simulation controller to work with slurm. It worked as expected. Initially the controller generates a list of all simulations that are configured. Then it calls sbatch for each of those, setting the job name differently for each simulation and itself as the batch script. When the controller start inside slurm it detects that and gets the correct job to do from the environment variables. Then it just runs as before, doing different steps, collecting results and so on ... Best Johannes On 2/15/19 3:53 PM, Gereon Kremer wrote:
Hi,
as far as I understand your scenario, it seems somewhat similar to what I have been working on... We essentially have a long list of commands (different binaries run with different arguments) that we need to run and collect the outputs of. Our main restriction is that the array jobs only allow for 1000 jobs.
What we do is the following: - Create a file of all the commands, one command per line - Create an array job that executes all commands in slices - Collect the results from the outputs
Our batch file roughly looks like this:
min=$SLURM_ARRAY_TASK_MIN max=$SLURM_ARRAY_TASK_MAX cur=$SLURM_ARRAY_TASK_ID tasks=`wc -l joblist` jobcount=$(( max - min + 1 )) slicesize=$(( (tasks + jobcount + 1) / jobcount )) start=$(( (cur - 1) * slicesize + min )) end=$(( start + slicesize - 1 )) for i in `seq ${start} ${end}`; do cmd=$(sed -n "${i}p" < joblist) echo "Executing $cmd" echo "# START ${i} #" ulimit -c 0 && ulimit -S -t 120 && $cmd ; rc=$? echo "# END ${i} #" done
Note that time limits must be implemented manually (here via ulimit). We then submit this file with --wait.
Does this help? (and cover your use case?)
Best, Gereon
On 2/15/19 3:45 PM, Johannes Sauer wrote:
I looked further. I think srun can not be used like this, as it also blocks when used inside sbatch I believe.
But I think I can just let the simulation controller run sbatch directly for each simulation and only use a single batch script which calls the controller again. Mapping to the correct simulation is then done via job name or other paramters. this is also very similar to how it works for LSF atm.
On 2/15/19 2:54 PM, Johannes Sauer wrote:
Hi,
for our simulations we have a simulation manager. For LSF this used to issue a bsub command for each simulation. We're not using array jobs for this as it has some more requirements.
I can not simply replace bsub with srun, I need to do a sbatch.
I believe this should work: Run the simulation manager with sbatch, then it should be able to do srun for the different simulations.
Will this work?
Best
Johannes
_______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de -- M.Sc. Johannes Sauer Researcher
Institut fuer Nachrichtentechnik RWTH Aachen University Melatener Str. 23 52074 Aachen Tel +49 241 80-27678 Fax +49 241 80-22196 sauer@ient.rwth-aachen.de http://www.ient.rwth-aachen.de
_______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
-- M.Sc. Johannes Sauer Researcher
Institut fuer Nachrichtentechnik RWTH Aachen University Melatener Str. 23 52074 Aachen Tel +49 241 80-27678 Fax +49 241 80-22196 sauer@ient.rwth-aachen.de http://www.ient.rwth-aachen.de