Hi all, I am using some piece of software that internally uses tensorflow to do some model training and evaluation. In the training phase I'd like to profit from the benefits of GPGPU and am therefore using a gpu version of tensorflow. In the inference phase, the gpu would be heavily underutilized and an evaluation of the models on cpu is totally acceptable. The gpu version of tensorflow in that case detects automatically that no gpu is present and runs all computation on cpu. It is however still linked against the cuda libraries and crashes if they are not present. My problem now is, that on the non-gpu part of the Claix18 cluster no cuda libraries seem to be available. steps to reproduce: $ ssh login18-1 ls /usr/local_rwth/sw/cuda/8.0.44 #No such file or directory $ ssh login18-g-1 ls /usr/local_rwth/sw/cuda/8.0.44 #Directory contents $ ssh login ls /usr/local_rwth/sw/cuda/8.0.44 #Directory contents $ sbatch --wrap="ls /usr/local_rwth/sw/cuda/8.0.44" #No such file or directory $ sbatch --gres=gpu:1 --wrap="ls /usr/local_rwth/sw/cuda/8.0.44"#Directory contents (I have taken this path from the environment variable LD_LIBRARY_PATH after loading the module cuda/80 on login18-g-1) I realize that there is the possibility to compile my software twice, one version using tensorflow-gpu and one using tensorflow(-cpu). But I think that having cuda module available also on cpu nodes is also usefull for other settings (e.g. compiling gpu-software without the need to block a gpu slot). So I would like to ask if cuda modules on cpu-nodes have been forgotten, or if this was a deliberate design decision. In the latter case I'd like to open the discussion to change this decision. Or am I missing some crucial point in the module system? Thank you very much and best regards Wilfried Michel
Dear Mr. Michel, I must admit, we did not install cuda on any host, besides some special systems, which has no NVIDIA GPU installed. I see and agree, that that is not always the best idea. So I changed this and now cuda should be available everywhere on CLAIX18. Best Marcus On 3/27/19 2:35 PM, Wilfried Michel wrote:
Hi all,
I am using some piece of software that internally uses tensorflow to do some model training and evaluation. In the training phase I'd like to profit from the benefits of GPGPU and am therefore using a gpu version of tensorflow. In the inference phase, the gpu would be heavily underutilized and an evaluation of the models on cpu is totally acceptable. The gpu version of tensorflow in that case detects automatically that no gpu is present and runs all computation on cpu. It is however still linked against the cuda libraries and crashes if they are not present.
My problem now is, that on the non-gpu part of the Claix18 cluster no cuda libraries seem to be available. steps to reproduce: $ ssh login18-1 ls /usr/local_rwth/sw/cuda/8.0.44 #No such file or directory $ ssh login18-g-1 ls /usr/local_rwth/sw/cuda/8.0.44 #Directory contents $ ssh login ls /usr/local_rwth/sw/cuda/8.0.44 #Directory contents $ sbatch --wrap="ls /usr/local_rwth/sw/cuda/8.0.44" #No such file or directory $ sbatch --gres=gpu:1 --wrap="ls /usr/local_rwth/sw/cuda/8.0.44"#Directory contents (I have taken this path from the environment variable LD_LIBRARY_PATH after loading the module cuda/80 on login18-g-1)
I realize that there is the possibility to compile my software twice, one version using tensorflow-gpu and one using tensorflow(-cpu). But I think that having cuda module available also on cpu nodes is also usefull for other settings (e.g. compiling gpu-software without the need to block a gpu slot).
So I would like to ask if cuda modules on cpu-nodes have been forgotten, or if this was a deliberate design decision. In the latter case I'd like to open the discussion to change this decision. Or am I missing some crucial point in the module system?
Thank you very much and best regards
Wilfried Michel _______________________________________________ claix18-slurm-pilot mailing list -- claix18-slurm-pilot@lists.rwth-aachen.de To unsubscribe send an email to claix18-slurm-pilot-leave@lists.rwth-aachen.de
-- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wagner@itc.rwth-aachen.de www.itc.rwth-aachen.de
participants (2)
-
Marcus Wagner
-
Wilfried Michel