Hi Paul,

can you say something to this?


Best
Marcus

On 1/18/19 11:54 AM, Marcus Wagner wrote:



-------- Forwarded Message --------
Subject: IntelMPI/2019 problems with our code
Date: Fri, 18 Jan 2019 11:44:57 +0100
From: Jonas Becker <jonas.becker2@rwth-aachen.de>
To: Marcus Wagner <wagner@itc.rwth-aachen.de>


Hi,

our simulation aborts on CLAIX-2018 when using the intelmpi/2019 or
intelmpi/2018 module. The new OpenMPI module works for us.

The master rank waits for data from the worker ranks in a busy loop:

        while(!flag) {
            if (is_chkpt_time()) M_write();
            MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,
&flag, &stat);
        }

The problem occurs pretty much reproducibly ~9 minutes after start of
the batch job. It occurs even at 2 mpi tasks (one master and one worker
process). During the first 9 minutes, the simulation works flawlessly.
All calls are in the MPI_COMM_WORLD communicator and work up to this
point. Then the simulation aborts:

Abort(635909) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Iprobe:
Invalid communicator, error stack:
PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
MPI_COMM_WORLD, flag=0x7ffe77c98c30, status=0x7ffe77c98c1c) failed
PMPI_Iprobe(90).: Invalid communicator

Do some of you successfully use intelmpi2018/2019 on the Claix-2018
cluster?

Our MPI code may contain a few mistakes that throw off intelmpi but not
openmpi. If simulations >10minutes are possible for everyone else, we
will look into finding and fixing it in our own code.


Another thing is that every time intelmpi is run in claix2018 batch
mode, it throws this warning:

MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT environment variable is not
supported.
MPI startup(): To check the list of supported variables, use the
impi_info utility or refer to
https://software.intel.com/en-us/mpi-library/documentation/get-started.

Best,

Jonas

On 1/17/19 10:04 AM, Marcus Wagner wrote:
Dear all,


we canceled the actual maintenance, but made a new one for tomorrow 9
o'clock. So all Jobs, which finish before that time, will run now.


Btw.

this list was not intended as a announcement list from our side.
Weren't there any problems? Is everything clear regarding SLURM and
CLAIX18 for you?

If that is the case, we are really happy, but I barely can believe this.



Best
Marcus




_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner@itc.rwth-aachen.de
www.itc.rwth-aachen.de