Hi all,
I am using some piece of software that internally uses tensorflow to do
some model training and evaluation. In the training phase I'd like to
profit from the benefits of GPGPU and am therefore using a gpu version
of tensorflow. In the inference phase, the gpu would be heavily
underutilized and an evaluation of the models on cpu is totally
acceptable. The gpu version of tensorflow in that case detects
automatically that no gpu is present and runs all computation on cpu. It
is however still linked against the cuda libraries and crashes if they
are not present.
My problem now is, that on the non-gpu part of the Claix18 cluster no
cuda libraries seem to be available.
steps to reproduce:
$ ssh login18-1 ls /usr/local_rwth/sw/cuda/8.0.44 #No such file or directory
$ ssh login18-g-1 ls /usr/local_rwth/sw/cuda/8.0.44 #Directory contents
$ ssh login ls /usr/local_rwth/sw/cuda/8.0.44 #Directory contents
$ sbatch --wrap="ls /usr/local_rwth/sw/cuda/8.0.44" #No such file or
directory
$ sbatch --gres=gpu:1 --wrap="ls
/usr/local_rwth/sw/cuda/8.0.44"#Directory contents
(I have taken this path from the environment variable LD_LIBRARY_PATH
after loading the module cuda/80 on login18-g-1)
I realize that there is the possibility to compile my software twice,
one version using tensorflow-gpu and one using tensorflow(-cpu). But I
think that having cuda module available also on cpu nodes is also
usefull for other settings (e.g. compiling gpu-software without the need
to block a gpu slot).
So I would like to ask if cuda modules on cpu-nodes have been forgotten,
or if this was a deliberate design decision. In the latter case I'd like
to open the discussion to change this decision.
Or am I missing some crucial point in the module system?
Thank you very much and best regards
Wilfried Michel