Submission rejections when using ntasks-per-node
Dear users, thanks to several reports we have discovered a problem when trying to submit multi-node jobs that request more than 24 tasks per node. In general a resource request looking like this should work perfectly fine: (...) #SBATCH --nodes=5 #SBATCH --ntasks-per-node=48 (...) Theoretically this would allow you to make full use of 5 nodes. Currently, however, sbatch rejects such job scripts claiming that there were no hosts suitable for dispatchment. Despite this, the following request (...) #SBATCH --ntasks=240 (...) works as intended while being semantically equal in this scenario. We are not sure exactly what is causing this problem but do suspect a bug in slurm, possibly in conjunction with the Skylake-SP CPUs. If you are affected, we recommend to use --ntasks only for the time being. We will change the documentation accordingly so that you can build your job scripts upon correct templates. The problem has been relayed to the developers, we will have to wait for their assessment. Please excuse any inconvenience. Best, Sven -- Sven Hansen IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen (Germany) Tel.: + 49 241 80-29114 s.hansen@itc.rwth-aachen.de www.itc.rwth-aachen.de
Dear users, due to an inquiry I realized I left out an important little part in my last mail. Using "ntasks" on its own will not necessarily ensure that your job will be spread upon exclusive nodes. To fix this, you can flag the job as exclusive: #SBATCH --exclusive Similarly, you can make use of the "nodes" argument paired with ntasks to control the job. In the example below, --nodes=5 will force slurm to pack the tasks onto 5 exclusive hosts. Best, Sven On 02/15/2019 01:58 PM, Sven Hansen wrote:
Dear users,
thanks to several reports we have discovered a problem when trying to submit multi-node jobs that request more than 24 tasks per node. In general a resource request looking like this should work perfectly fine:
(...) #SBATCH --nodes=5 #SBATCH --ntasks-per-node=48 (...)
Theoretically this would allow you to make full use of 5 nodes. Currently, however, sbatch rejects such job scripts claiming that there were no hosts suitable for dispatchment. Despite this, the following request
(...) #SBATCH --ntasks=240 (...)
works as intended while being semantically equal in this scenario. We are not sure exactly what is causing this problem but do suspect a bug in slurm, possibly in conjunction with the Skylake-SP CPUs. If you are affected, we recommend to use --ntasks only for the time being. We will change the documentation accordingly so that you can build your job scripts upon correct templates. The problem has been relayed to the developers, we will have to wait for their assessment.
Please excuse any inconvenience.
Best, Sven
-- Sven Hansen IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen (Germany) Tel.: + 49 241 80-29114 s.hansen@itc.rwth-aachen.de www.itc.rwth-aachen.de
participants (1)
-
Sven Hansen