-------- Forwarded Message -------- Subject: IntelMPI/2019 problems with our code Date: Fri, 18 Jan 2019 11:44:57 +0100 From: Jonas Becker <jonas.becker2@rwth-aachen.de> To: Marcus Wagner <wagner@itc.rwth-aachen.de> Hi, our simulation aborts on CLAIX-2018 when using the intelmpi/2019 or intelmpi/2018 module. The new OpenMPI module works for us. The master rank waits for data from the worker ranks in a busy loop: while(!flag) { if (is_chkpt_time()) M_write(); MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &stat); } The problem occurs pretty much reproducibly ~9 minutes after start of the batch job. It occurs even at 2 mpi tasks (one master and one worker process). During the first 9 minutes, the simulation works flawlessly. All calls are in the MPI_COMM_WORLD communicator and work up to this point. Then the simulation aborts: Abort(635909) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Iprobe: Invalid communicator, error stack: PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, MPI_COMM_WORLD, flag=0x7ffe77c98c30, status=0x7ffe77c98c1c) failed PMPI_Iprobe(90).: Invalid communicator Do some of you successfully use intelmpi2018/2019 on the Claix-2018 cluster? Our MPI code may contain a few mistakes that throw off intelmpi but not openmpi. If simulations >10minutes are possible for everyone else, we will look into finding and fixing it in our own code. Another thing is that every time intelmpi is run in claix2018 batch mode, it throws this warning: MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT environment variable is not supported. MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started. Best, Jonas On 1/17/19 10:04 AM, Marcus Wagner wrote:
Dear all,
we canceled the actual maintenance, but made a new one for tomorrow 9 o'clock. So all Jobs, which finish before that time, will run now.
Btw.
this list was not intended as a announcement list from our side. Weren't there any problems? Is everything clear regarding SLURM and CLAIX18 for you?
If that is the case, we are really happy, but I barely can believe this.
Best Marcus
Hi Paul, can you say something to this? Best Marcus On 1/18/19 11:54 AM, Marcus Wagner wrote:
-------- Forwarded Message -------- Subject: IntelMPI/2019 problems with our code Date: Fri, 18 Jan 2019 11:44:57 +0100 From: Jonas Becker <jonas.becker2@rwth-aachen.de> To: Marcus Wagner <wagner@itc.rwth-aachen.de>
Hi,
our simulation aborts on CLAIX-2018 when using the intelmpi/2019 or intelmpi/2018 module. The new OpenMPI module works for us.
The master rank waits for data from the worker ranks in a busy loop:
while(!flag) { if (is_chkpt_time()) M_write(); MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &stat); }
The problem occurs pretty much reproducibly ~9 minutes after start of the batch job. It occurs even at 2 mpi tasks (one master and one worker process). During the first 9 minutes, the simulation works flawlessly. All calls are in the MPI_COMM_WORLD communicator and work up to this point. Then the simulation aborts:
Abort(635909) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Iprobe: Invalid communicator, error stack: PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, MPI_COMM_WORLD, flag=0x7ffe77c98c30, status=0x7ffe77c98c1c) failed PMPI_Iprobe(90).: Invalid communicator
Do some of you successfully use intelmpi2018/2019 on the Claix-2018 cluster?
Our MPI code may contain a few mistakes that throw off intelmpi but not openmpi. If simulations >10minutes are possible for everyone else, we will look into finding and fixing it in our own code.
Another thing is that every time intelmpi is run in claix2018 batch mode, it throws this warning:
MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT environment variable is not supported. MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
Best,
Jonas
On 1/17/19 10:04 AM, Marcus Wagner wrote:
Dear all,
we canceled the actual maintenance, but made a new one for tomorrow 9 o'clock. So all Jobs, which finish before that time, will run now.
Btw.
this list was not intended as a announcement list from our side. Weren't there any problems? Is everything clear regarding SLURM and CLAIX18 for you?
If that is the case, we are really happy, but I barely can believe this.
Best Marcus
_______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
-- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner@itc.rwth-aachen.de www.itc.rwth-aachen.de
participants (1)
-
Marcus Wagner