Hi Sebastian,

On 1/28/19 9:09 AM, Sebastian Achilles wrote:

Hi Marcus,

thank you very much for your answer!

Do the nrm have a different configuration compared to the ncm nodes? I am still wondering why sometimes my job fails when I just submit the same job multiple times. Most of the jobs that failed run on the nrm nodes (I am getting a `Bus error`).

No, the nrm nodes have the very same configuration. It seems as if they had not been installed completely/tested thoroughly before. That is, why I took them out of service for the moment. Yet I'm a bit dazzled regarding the 'Bus error'. Do you have a bit more details for me? E.g. the job ID, or where and when the job ran?

I am not specifying the `#SBATCH --mem` option, since I assume that I will get the whole memory of that node. Is this correct? This is how I am used to use SLURM on JURECA and JUWELS. And this is how I understood the documentation:
"NOTE: A memory size specification of zero is treated as a special case and grants the job access to all of the memory on each node. If the job is allocated multiple nodes in a heterogeneous cluster, the memory limit on each node will be that of the node in the allocation with the smallest memory size (same limit will apply to every node in the job's allocation)."

We try to prohibit "--mem" as option, as we would like the user to ask for memory per task. So, yes, please do not use --mem. We might considering the usage of "--mem=0" though, but not sure yet. We will have to discuss this internally.

Are nodes in SLURM on CLAIX18 scheduled exclusively? So when I request a certain number of nodes, is it ensured, that I am the only user running on these nodes?

All JARA-jobs are scheduled exclusively. We will schedule exclusive, if you need more than one node (like done atm. on LSF), but this is not active yet.
We are discussing to use exclusive=user per default. Thus only jobs of the same user can be on one node. This can be overridden by #SBATCH --exclusive.

Have you implemented any kind of default CPU binding or pinning? Or does the user have specify this in there job scripts?

not yet, as we did not want to disturb the benchmarks by NEC on CLAIX18. As the GPU nodes are still beeing benchmarked, and this is a clusterwide option, we will not activate anything at the moment. So you will have to do this in in your script.


Best
Marcus

Best,
Sebastian


On 24.01.19 14:23, Marcus Wagner wrote:
Hi Sebastian,

On 1/24/19 1:59 PM, Sebastian Achilles wrote:

Hi,

a few questions/points form my side:

- Claix 18 is using Intel OPA. What network topology is used? I guess it is a Fat tree. Is it blocking or non-blocking? Is it 1:2 blocking as on Claix 16?

To make it short: Fat Tree, right, blocking, yes.

- Have you configured topology-aware resource allocation within the SLURM scheduler, in other words does the scheduler knows the topology and tries to minimize the hop-count?

Since we are still in the acceptance phase, this is not the case. But will be in the future.

- I assume TurboBoost is enabled by default? Is it possible (or will it be possible in the future) to include an option to switch TurboBoost off? E.g. on JURECA it is possible to disable TurboBoost with `#SBATCH --disable-turbomode` for measurements.
Otherwise it is possible to set frequencies with likwid?

I'm not sure, if TurboBoost is activated, normally we try to fix the frequency to the maximum. Perhaps Sascha and/or Paul can answer this part.

- I tried to run jobs with 256 nodes. I am getting an MPI error (cf. job 92414):
nrm008.hpc.itc.rwth-aachen.de.233340PSM2 no hfi units are active (err=23)
[245] MPI startup(): tmi fabric is not available and fallback fabric is not enabled

Any ideas where this is coming from? Should I manually adjust I_MPI_FABRICS? I don't want to set the fallback fabric, since this would be TCP and would significantly impact performance. Affected jobs are not canceled but are running into the time limit.

Theses errors are only occurring on the nrm nodes, not on the ncm nodes. Could there be a problem with the nrm nodes? Currently I am using the partition c18m, which contains both nodes types. How can I only select ncm nodes?

The nrm nodes are the first batch of the new Tier3 system. Seems, something is still odd with them. I took them out of the partition again.


- Historic jobs can not be viewed with scontrol or sstat. sacct on the other hand works. For example:
$ scontrol show job 90949
slurm_load_jobs error: Invalid job id specified

not sure about that one, I always thought, you could get details of jobs shown with squeue. This means, completed jobs will not be shown. I could find nothing about that in the man page though.

$ sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j  90945
    AveCPU   AvePages     AveRSS  AveVMSize        JobID
---------- ---------- ---------- ---------- ------------
sstat: error: couldn't get steps for job 90945

excerpt from the manpage:
DESCRIPTION
       Status information for running jobs invoked with Slurm.

So, with sstat, you can only observe running jobs.

- Some of my jobs don't appear in the queue and are not scheduled, even if sbatch returns `Submitted batch job 92365`

[2019-01-24T12:16:54.089] _slurm_rpc_submit_batch_job: JobId=92365 InitPrio=103522 usec=8187
[2019-01-24T12:16:54.227] email msg to : Slurm Job_id=92365 Name=1D-NEGF_execute_1 Began, Queued time 00:00:00
[2019-01-24T12:16:54.227] sched: Allocate JobId=92365 NodeList=nrm023 #CPUs=24 Partition=c18m
[2019-01-24T12:16:54.365] prolog_running_decr: Configuration for JobId=92365 is complete
[2019-01-24T12:16:56.296] _job_complete: JobId=92365 WEXITSTATUS 127
[2019-01-24T12:16:56.296] email msg to : Slurm Job_id=92365 Name=1D-NEGF_execute_1 Failed, Run time 00:00:02, FAILED, ExitCode 127
[2019-01-24T12:16:56.296] _job_complete: JobId=92365 done

$> sacct -j 92365
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
92365        1D-NEGF_e+       c18m    default         24     FAILED    127:0
92365.batch       batch               default         24     FAILED    127:0
92365.extern     extern               default         24  COMPLETED      0:0
92365.0      1D-NEGF-M+               default         24     FAILED    127:0

It has been immediately scheduled and failed. You should have received an email.

- You are using a hand-build module system. One issue with this approach is that dependencies are not dissolved properly. For example loading the module python does something unexpected. A short example:

$ module load intel; ldd main.x | grep mkl
intel/19.0 already loaded, doing nothing                                            [ WARNING ]
        libmkl_intel_lp64.so => /opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_lp64.so (0x00002ac58b9ee000)
        libmkl_core.so => /opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_core.so (0x00002ac58c53c000)
        libmkl_intel_thread.so => /opt/intel/Compiler/19.0/1.144/rwthlnk/mkl/lib/intel64_lin/libmkl_intel_thread.so (0x00002ac5906c8000)

$ module load python; ldd main.x | grep mkl
Loading python 2.7.12                                                                    [ OK ]
The SciPy Stack available: http://www.scipy.org/stackspec.html
 Build with GCC compilers.
        libmkl_intel_lp64.so => /usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_lp64.so (0x00002abea70ad000)
        libmkl_core.so => /usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_core.so (0x00002abea7bcb000)
        libmkl_intel_thread.so => /usr/local_rwth/sw/python/2.7.12/x86_64/extra/lib/libmkl_intel_thread.so (0x00002abea96ba000)

Using mkl_get_version_string() shows that the python mkl is version 2017.0.0 instead of the expected 2019.0.1 version that should be loaded.

A different approach to a hand-build module system would be using easybuild to create the module system. This would avoid such issues.
The blue print paper can be found here:
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7830454
JSC has published there easybuild configuration on github:
https://github.com/easybuilders/JSC
The config files from
HPC-UGent are also publicly available.

Hi Paul, I think, this is your part.

Best,
Sebastian



Best,
Marcus

_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner@itc.rwth-aachen.de
www.itc.rwth-aachen.de

_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de

_______________________________________________
claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de
To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner@itc.rwth-aachen.de
www.itc.rwth-aachen.de