How to run Tensorflow-GPU in Podman?

Hi.
I have a Dell XPS 9550. It has a discrete NVIDIA GPU along with intel i7 6700-HQ. I am running Fedora 32. I am interested in running Tensorflow with GPU. I have installed the NVIDIA drivers. I ran podman pull tensorflow/tensorflow:latest-gpu to pull the Tensorflow image on my machine from DockerHub. As per their documentation, for this container to run with the GPU, I only need NVIDIA drivers on the host. The CUDA and other runtimes are installed in the image itself. I tried to check if this image is working with the GPU using
podman run tensorflow/tensorflow:latest-gpu python3 -c "import tensorflow as tf; tf.config.list_physical_devices('GPU')"
But this gives me

2020-07-29 10:44:58.220902: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-29 10:44:58.220958: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (-1)
2020-07-29 10:44:58.221012: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist

But the GPU certainly exists on the host. How can I get this to work with podman which I understand to be a drop-in replacement of docker?

1 Like

I can’t comment on the particular issue since I don’t use nvidia hardware any more.

This is not the case for all uses of containers. Docker, by default must run with super user privileges. Podman, by default does not and that works for lots of use cases. However, lots of applications do require superuser privileges and so to run them correctly you must run podman with sudo too.

(In short, if something needs superuser privilieges, Podman will not magically remove this requirement.)

Ok. Thank you for the clarification. Till now, I have been able to make podman work for most of the containers with some adjustments. I am hoping somebody has faced and solved this issue before me. Or else, I will try to install docker.

If you are confident your GPU is set up correctly, the next step would simply be to try running podman as root too. That’s closer to docker’s behaviour.

If you do want to try docker too, please take a look at this post:

Hi @amit112amit. Try this https://gist.github.com/bernardomig/315534407585d5912f5616c35c7fe374

2 Likes

Yes, I have run some benchmarks to test my GPU performance vs the integrated graphics. So definitely the GPU is set up correctly. Running podman as root did not help. I found that there is an open issue about podman and nvidia-container-runtime. So it seems I should try docker like you have suggested.

1 Like

Thank you for the gist @brogos. I tried to follow it but I am unable to complete step 2. The link for step 2 is no longer working. But I found the official installation page for nvidia-container-runtime. Unfortunately, when I try to install it I get an error that Fedora 32 is an unsupported OS.

1 Like

@amit112amit this guy fixed this problem using nvidia-container-runtime from RedHat https://www.quora.com/How-can-TensorFlow-with-NVIDIA-GPU-support-be-installed-on-Fedora-32

1 Like

Thank you @brogos and @FranciscoD for your inputs. I tried to switch from podman to docker. But docker broke the networking of my QEMU/KVM setup. This is a known issue with a workaround. It also requires switching to the old CGroups as a part of docker installation. So I was not very satisfied with it.
Solution:
I was able to get podman to work using the links proposed by @brogos. The important steps are:

  1. Install NVIDIA driver on the host. Currently it is at version 440:100 for Fedora 32.
  2. Install nvidia-container-toolkit following the instructions here. Note: You will get an error that Fedora 32 is unsupported distribution, so just set distribution=rhel8.2.
  3. Edit /etc/nvidia-container-runtime/config.toml to set no-cgroups = true.
  4. Whatever container image you want to run should match the CUDA version supported by the NVIDIA driver installed on the host. For driver 440:100 it is CUDA 10.2. The nvidia/cuda:latest docker image is at CUDA 11 so it will not work. I was making this mistake at first.
  5. To test the installation you can run the following command which uses nvidia-smi which is provided by xorg-x11-drv-nvidia-cuda package from RPMFusion on the host machine.
podman run -it --rm --security-opt=label=disable nvidia/cuda:10.2-base nvidia-smi
Sat Aug  1 15:43:00 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960M    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P8    N/A /  N/A |     36MiB /  2004MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

If you don’t get this output most likely you will get some error message. You may have to check if you have followed all the steps correctly. All this is based on the discussion in this github issue.

ALTERNATE SOLUTION:
I came across another solution proposed by u/Abraxis_Dragon on this comment on r/Fedora. I find this solution as the most hassle-free. It uses singularity containers. I followed the following steps:

  1. Install NVIDIA drivers from RPMFusion as explained here.
  2. Install Singularity which is available in official repo.
    sudo dnf install singularity.
  3. Build the Tensorflow GPU container into a Singularity container
    singularity build mytensorflow.sif docker://tensorflow/tensorflow:latest-gpu.
  4. Run the container using --nv flag to allow direct access to the NVIDIA GPU.
    singularity run --nv mytensorflow.sif
  5. We can check that GPU is actually available inside the container by running the following command:
INFO:    Could not find any nv files on this host!

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


You are running this container as user with ID 1000 and group 1000,
which should map to the ID and group for your user on the Docker host. Great!

Singularity> python3 -c "import tensorflow as tf; tf.config.list_physical_devices('GPU')"
2020-08-01 14:42:30.562914: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-01 14:42:32.529449: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-01 14:42:32.537876: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-01 14:42:32.538595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.0975GHz coreCount: 5 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 74.65GiB/s
2020-08-01 14:42:32.538636: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-01 14:42:32.584239: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-08-01 14:42:32.609743: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-08-01 14:42:32.618569: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-08-01 14:42:32.666673: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-08-01 14:42:32.676543: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-08-01 14:42:32.765228: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-08-01 14:42:32.765425: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-01 14:42:32.766202: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-01 14:42:32.766547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0

So Tensorflow is actually able to use the GPU without any configuration tweaks or workarounds! Hope this will help somebody!

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.