Home - Notes Light theme

 ______________________________
/ Setting up an NVIDIA GPU for \
\ deep learning on Debian 9.8  /
 ------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

I tested the following procedure on Debian GNU/Linux 9.8 (stretch) only. It may work on different Debian or GNU/Linux versions as well as it might not. In case of problems, uninstalling the nvidia driver should be enough to get back to a previously working system.

It took me a while to successfully setup my laptop to use keras/tensorflow with gpu support. After several attempts I've been finally able to do it thanks to several tutorial and Q&A posts found on the web.

This post is just a linear, self-contained summary of those resources. Here you can find the main original references:

Let's start!

First of all, enable the contrib and non-free repositories by editing /etc/apt/sources.list.

sudo nano /etc/apt/sources.list

Append contrib non-free to the existing sources. In my system they look like this

deb http://ftp.it.debian.org/debian/ stretch main non-free contrib
deb-src http://ftp.it.debian.org/debian/ stretch main non-free contrib

deb http://security.debian.org/debian-security stretch/updates main contrib non-free
deb-src http://security.debian.org/debian-security stretch/updates main contrib non-free

deb http://deb.debian.org/debian/ stretch-updates main contrib non-free
deb-src http://deb.debian.org/debian/ stretch-updates main contrib non-free

Next, update the package indices by running

sudo apt update

At this point it should be possible to install the nvidia-detect package. It will tell you whether your gpu is recognized by your system or not.

sudo apt install nvidia-detect

Running nvidia-detect I obtain

Detected NVIDIA GPUs:
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8c] (rev a1)

Checking card:  NVIDIA Corporation Device 1c8c (rev a1)
Your card is supported by the default drivers.
It is recommended to install the
nvidia-driver
package.

This means that my gpu is recognized. Do not install the nvidia-driver as recommended.

As an alternative method you can list the available gpus by running

lspci -nn | egrep -i "3d|display|vga"

In my case it reports

00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:591b] (rev 04)
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8c] (rev a1)

Meaning that two gpus were found: the Intel integrated and the Nvidia ones.

So now we can move on with the actual driver installation.

The current Tensorflow version supports CUDA 9.0 toolkit. Go to https://developer.nvidia.com/cuda-90-download-archive and select the latest runfile installer version for Ubuntu. Download the base installer and all the available patch files. As it is suggested in https://unix.stackexchange.com/questions/218163/how-to-install-cuda-toolkit-7-8-9-on-debian-8-jessie-or-9-stretch do not store the files in /tmp.

Locate the downloaded .run files and make them executable by running

chmod a+x cuda_9.0*.run

Next, execute the base installer by running

sudo ./cuda_9.0.176_384.81_linux.run

Read and accept the licence. Then, answer the following questions as follows

You are attempting to install on an unsupported configuration. Do you wish to continue?
(y)es/(n)o [ default is no ]

Type y if this message prompts and you are sure to have an nvidia gpu that supports CUDA. You can check it by reading the specification of your gpu model on the nvidia website. Here is mine: https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-1050-ti/specifications, check Supported Technologies for CUDA support.

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit:

Type y

Do you want to install the OpenGL libraries?
(y)es/(n)o/(q)uit [ default is yes ]:

Type y

Do you want to run nvidia-xconfig?
This will update the system X configuration file so that the NVIDIA X driver
is used. The pre-existing X configuration file will be backed up.
This option should not be used on systems that require a custom
X configuration, such as systems with multiple GPU vendors.
(y)es/(n)o/(q)uit [ default is no ]:

Type n

Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit:

Type y

Enter Toolkit Location
[ default is /usr/local/cuda-9.0 ]:

Accept the default path (or change it if you prefer)

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit:

Type y

Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit:

Type y

Enter CUDA Samples Location
[ default is /home/acco ]: press enter or change path

Accept the default path (or change it if you prefer)

This should be part of the output

Installing the NVIDIA display driver...
A system reboot is required to continue installation.
Please reboot then run the installer again.
An attempt has been made to disable Nouveau.
If this message persists after reboot, please see the display driver log file
at /var/log/nvidia-installer.log for more information.

===========
= Summary =
===========

Driver:   Reboot required to continue
Toolkit:  Installation skipped
Samples:  Installation skipped

Apparently a reboot is needed in order to finalize the driver installation and proceed with the toolkit/samples installation.

sudo reboot

Now, for some reason the resolution may have been lowered. Don't worry too much, once the installation is complete everything should return as it originally was!

Locate and run the base installer again. Answer the questions the same as before. Keep track of the samples directory. It should be something like /home/username/NVIDIA_CUDA-9.0_Samples/ if you used the default path.

Once the installation is complete, if the screen resolution was changed, reboot and hope for the best. Otherwise go on without rebooting.

Locate the samples and enter one of them.

cd ~/NVIDIA_CUDA-9.0_Samples/0_Simple/vectorAdd

Build the source with

make

In order to successfully build the source you should have e.g. build-essential installed in your system.

sudo apt install build-essential

Run the example

./vectorAdd

It should output something like

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Apparently you have succeeded! ... but you have not finished yet! Go on by installing the patches you previously downloaded along with the base installer.

sudo ./cuda_9.0.176.1_linux.run
sudo ./cuda_9.0.176.2_linux.run
sudo ./cuda_9.0.176.3_linux.run
sudo ./cuda_9.0.176.4_linux.run

For all of them, accept the licence and the default installation path if you left the same in the base installer too.

Now let's move to install the cuDNN library. Go to https://developer.nvidia.com/cudnn and press download cuDNN. Register if you don't have an account.

Once you are logged in, return to https://developer.nvidia.com/cudnn press download cuDNN again. Accept the terms, expand cuDNN library for CUDA 9.0 and download the cuDNN Runtime Library for Ubuntu 16.04 (Deb).

Install the .deb file. You can use gdebi.

sudo apt install gdebi

then

sudo gdebi libcudnn7_7.5.0.56-1\ cuda9.0_amd64.deb

Be careful there is a space in the file name (use tab to auto-complete!).

Update your LD_LIBRARY_PATH by adding the following text as a newline to the end of your .bashrc file. You might need to update a different file if you are using another terminal.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64/

Log out and log in again.

Done! Let's try if it works.

Install tensorflow-gpu. If you are using pip3, type

sudo pip3 install tensorflow-gpu

Open a new terminal session and run the python interpreter (e.g. python3). Execute

import tensorflow as tf

tf.test.is_gpu_available(
cuda_only=False,
min_cuda_compute_capability=None)

It should output something like

2019-02-25 11:59:05.760120: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-25 11:59:06.091764: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-25 11:59:06.092167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 3.94GiB freeMemory: 3.89GiB
2019-02-25 11:59:06.092181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-25 11:59:06.878669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-25 11:59:06.878725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-02-25 11:59:06.878741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-02-25 11:59:06.879727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 3621 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
True

Tensorflow was able to find the gpu device! That is all! This worked for me! ... and I hope it will work for you too.