[claix18-slurm-pilot] Re: Network Performance

5 Feb 2019

      Hi,

On 01/28/2019 10:29 AM, Sebastian Achilles wrote:
...
I have looked into the Network Performance on CLAIX18. I have measured latency
and bandwidth for intra and inter node communication. I have used the Intel IMB
PingPong Benchmark complied with the modules intel/19.0 and intelmpi/2019.
Intel MPI 2019 has been told by Intel itself as 'experimental' and is known to
show up as inconvenient behavuoiur in some cases. That is why we switched to
intelmpi/2018 back after a few days. we recommend to use v2018 and if in doubt
compare results to openmpi and/or v2017. (Also v2018 is known to have some
issues now.)
However, short test of  IMB-MPI1 PingPong tell us that v2019 did *not* seem to
have a known bandwidth/performance issue compared to older versions.
...
To
get a sufficient statistic I have submitted 64 jobs with 1 node using 2 tasks
and 64 jobs with 2 nodes using 1 task each, respectively. The scheduler has
started the jobs on different sets of nodes. I have attached the results,
showing the configuration and the average, min and max of the measurement.
Let's first look at the inter node communication: I have measured an average
latency of 2.12 usec. In the best case I measured 1 usec and in the maximum is
7.1 usec. The bandwidth is on average 6488 Mbytes/sec. The maximum is 11995
Mbytes/sec and minimum is 2483 Mbytes/sec.
You have to expect some stable 11.5...12.0 GB/s on bandwidth and latency in
order of 1...2[usec] on the OmniPath network *when you run on nodes on same leaf
switch*, and much worse (quite varyiing with avg about 6GB/s? bandwidth) when
running on nodes connected via the thin (upper) switches. Take a look at the
attached picture, very characteristic to OPA network of used topology (CLAIX2016
use the same network technology and topolopogy).

My assumption is that your multinode pingpong tests run either on non-exclusive
nodes (???) or (sometimes) on nodes not at the same leaf switch.
...
The latency for intra node communication looks okay, however the bandwidth shows
variation.
On average theses results don't correspond with the advertised values from
Intel. Either I have done something wrong or I haven't understood the topology
or there is a problem with the machine.
Have you run such a benchmark as well? Can you observe something similar?
Just try out:
pk224850@login18-1:~[506]$ export  I_MPI_FABRICS=shm:tmi
pk224850@login18-1:~[516]$ $MPIEXEC -m 1 -H login18-1,login18-2  IMB-MPI1 PingPong

... you should get result in order of almost 12GB/s - see below.

(Do not repeat this test too often as it loads the front ends. Change to
I_MPI_FABRICS is needed ONLY IN INTERACTIVE TESTS as by actual default we
failback to DAPL in order to support the (old, Bull) MPI back ends still
available now - will ne changed soon.
DO NOT CHABGE THIS ENVVAR BY DEFAULT
DO NOT CHANGE THIS ENVVAR IN BATCH UNLESS ORDERED
... as this variable control your MPI transport with owerwhelming results to
stability and performance.

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.12         0.00
            1         1000         1.12         0.89
            2         1000         1.11         1.79
            4         1000         1.11         3.60
            8         1000         1.11         7.20
           16         1000         1.39        11.53
           32         1000         1.39        23.03
           64         1000         1.39        46.01
          128         1000         1.38        93.02
          256         1000         1.38       185.38
          512         1000         1.44       356.16
         1024         1000         1.61       637.80
         2048         1000         1.86      1100.50
         4096         1000         2.38      1719.96
         8192         1000         3.48      2353.33
        16384         1000         7.43      2204.21
        32768         1000         9.83      3332.78
        65536          640        18.40      3562.04
       131072          320        23.82      5503.58
       262144          160        33.89      7735.72
       524288           80        53.71      9760.96
      1048576           40        96.31     10887.26
      2097152           20       184.08     11392.80
      4194304           10       352.39     11902.29
...
@Marcus: To get a better understanding of the machine, could you please share a
bit more information on the network topology:
- How many levels does the tree have?
- On which level is the tree pruned?
- Could you send me the connectivity file / connection map, e.g. a list of
cables connecting the nodes, edge and core switches? I would like to add the hop
count information into my result. (I have a script for computing the hop count
from a connection map. Depending on the format I just need to adjust the reading
routine)
Cheers,
Sebastian
_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
-- 
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, IT Center
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915