claix18-slurm-pilot

Download

claix18-slurm-pilot@lists.rwth-aachen.de

April 2019

18 participants
11 discussions

CUDA on non-gpu machines
by Wilfried Michel 01 Apr '19

01 Apr '19

Hi all, I am using some piece of software that internally uses tensorflow to do some model training and evaluation. In the training phase I'd like to profit from the benefits of GPGPU and am therefore using a gpu version of tensorflow. In the inference phase, the gpu would be heavily underutilized and an evaluation of the models on cpu is totally acceptable. The gpu version of tensorflow in that case detects automatically that no gpu is present and runs all computation on cpu. It is however still linked against the cuda libraries and crashes if they are not present. My problem now is, that on the non-gpu part of the Claix18 cluster no cuda libraries seem to be available. steps to reproduce: $ ssh login18-1 ls /usr/local_rwth/sw/cuda/8.0.44 #No such file or directory $ ssh login18-g-1 ls /usr/local_rwth/sw/cuda/8.0.44 #Directory contents $ ssh login ls /usr/local_rwth/sw/cuda/8.0.44 #Directory contents $ sbatch --wrap="ls /usr/local_rwth/sw/cuda/8.0.44" #No such file or directory $ sbatch --gres=gpu:1 --wrap="ls /usr/local_rwth/sw/cuda/8.0.44"#Directory contents (I have taken this path from the environment variable LD_LIBRARY_PATH after loading the module cuda/80 on login18-g-1) I realize that there is the possibility to compile my software twice, one version using tensorflow-gpu and one using tensorflow(-cpu). But I think that having cuda module available also on cpu nodes is also usefull for other settings (e.g. compiling gpu-software without the need to block a gpu slot). So I would like to ask if cuda modules on cpu-nodes have been forgotten, or if this was a deliberate design decision. In the latter case I'd like to open the discussion to change this decision. Or am I missing some crucial point in the module system? Thank you very much and best regards Wilfried Michel

2 1