Dear all,
from time to time I keep getting erros similar to this one when submitting jobs:
RuntimeError: Execution of 'sbatch -t 3600 --mem-per-cpu=10G --account=rwth0333 --job-name ChairliftRide_8192x4096_QP22_FTBE0to32 -o log/ChairliftRide_8192x4096_QP22_FTBE0to32.queue_out.log -e log/ChairliftRide_8192x4096_QP22_FTBE0to32.queue_out.log rz_start_anysim.sh' exited with status != 0 (1): sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Anyone else having this problem? Doing the same submission again works fine. Looks like the controller can not handle the load of submissions?
Best
Johannes
Hi Johannes,
yes, we are also seeing these messages in the logfiles, yet we could not solve the issue up to now.
Best Marcus
On 5/3/19 10:49 AM, Johannes Sauer wrote:
Dear all,
from time to time I keep getting erros similar to this one when submitting jobs:
RuntimeError: Execution of 'sbatch -t 3600 --mem-per-cpu=10G --account=rwth0333 --job-name ChairliftRide_8192x4096_QP22_FTBE0to32 -o log/ChairliftRide_8192x4096_QP22_FTBE0to32.queue_out.log -e log/ChairliftRide_8192x4096_QP22_FTBE0to32.queue_out.log rz_start_anysim.sh' exited with status != 0 (1): sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Anyone else having this problem? Doing the same submission again works fine. Looks like the controller can not handle the load of submissions?
Best
Johannes
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
claix18-slurm-pilot@lists.rwth-aachen.de