Hi, I've lately noticed some of my jobs failing (timing out) with:
srun: Job 1692770 step creation temporarily disabled, retrying srun: error: Unable to create step for job 1692770: Unable to contact slurm controller (connect failure)
Any ideas what could be going wrong? I've been running similar jobs for a long time and this type of failures seem quite recent...
Best regards Pavel
claix18-slurm-pilot@lists.rwth-aachen.de