-------- Forwarded Message --------
Hi,
our simulation aborts on CLAIX-2018 when using the intelmpi/2019
or
intelmpi/2018 module. The new OpenMPI module works for us.
The master rank waits for data from the worker ranks in a busy
loop:
while(!flag) {
if (is_chkpt_time()) M_write();
MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD,
&flag, &stat);
}
The problem occurs pretty much reproducibly ~9 minutes after start
of
the batch job. It occurs even at 2 mpi tasks (one master and one
worker
process). During the first 9 minutes, the simulation works
flawlessly.
All calls are in the MPI_COMM_WORLD communicator and work up to
this
point. Then the simulation aborts:
Abort(635909) on node 0 (rank 0 in comm 0): Fatal error in
PMPI_Iprobe:
Invalid communicator, error stack:
PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
MPI_COMM_WORLD, flag=0x7ffe77c98c30, status=0x7ffe77c98c1c) failed
PMPI_Iprobe(90).: Invalid communicator
Do some of you successfully use intelmpi2018/2019 on the
Claix-2018
cluster?
Our MPI code may contain a few mistakes that throw off intelmpi
but not
openmpi. If simulations >10minutes are possible for everyone
else, we
will look into finding and fixing it in our own code.
Another thing is that every time intelmpi is run in claix2018
batch
mode, it throws this warning:
MPI startup(): I_MPI_JOB_STARTUP_TIMEOUT environment variable is
not
supported.
MPI startup(): To check the list of supported variables, use the
impi_info utility or refer to
https://software.intel.com/en-us/mpi-library/documentation/get-started.
Best,
Jonas
On 1/17/19 10:04 AM, Marcus Wagner wrote:
Dear all,
we canceled the actual maintenance, but made a new one for tomorrow 9
o'clock. So all Jobs, which finish before that time, will run now.
Btw.
this list was not intended as a announcement list from our side.
Weren't there any problems? Is everything clear regarding SLURM and
CLAIX18 for you?
If that is the case, we are really happy, but I barely can believe this.
Best
Marcus