______________________________
/ Setting up an NVIDIA GPU for \
\ deep learning on Debian 9.8 /
------------------------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
I tested the following procedure on Debian GNU/Linux 9.8 (stretch) only. It may work on different Debian or GNU/Linux versions as well as it might not. In case of problems, uninstalling the nvidia driver should be enough to get back to a previously working system.
It took me a while to successfully setup my laptop to use keras/tensorflow with gpu support. After several attempts I've been finally able to do it thanks to several tutorial and Q&A posts found on the web.
This post is just a linear, self-contained summary of those resources. Here you can find the main original references:
Let's start!
First of all, enable the contrib
and non-free
repositories by editing /etc/apt/sources.list
.
sudo nano /etc/apt/sources.list
Append contrib non-free
to the existing sources. In my system they look like this
deb http://ftp.it.debian.org/debian/ stretch main non-free contrib
deb-src http://ftp.it.debian.org/debian/ stretch main non-free contrib
deb http://security.debian.org/debian-security stretch/updates main contrib non-free
deb-src http://security.debian.org/debian-security stretch/updates main contrib non-free
deb http://deb.debian.org/debian/ stretch-updates main contrib non-free
deb-src http://deb.debian.org/debian/ stretch-updates main contrib non-free
Next, update the package indices by running
sudo apt update
At this point it should be possible to install the nvidia-detect
package.
It will tell you whether your gpu is recognized by your system or not.
sudo apt install nvidia-detect
Running nvidia-detect
I obtain
Detected NVIDIA GPUs:
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8c] (rev a1)
Checking card: NVIDIA Corporation Device 1c8c (rev a1)
Your card is supported by the default drivers.
It is recommended to install the
nvidia-driver
package.
This means that my gpu is recognized. Do not install the nvidia-driver
as recommended.
As an alternative method you can list the available gpus by running
lspci -nn | egrep -i "3d|display|vga"
In my case it reports
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:591b] (rev 04)
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8c] (rev a1)
Meaning that two gpus were found: the Intel integrated and the Nvidia ones.
So now we can move on with the actual driver installation.
The current Tensorflow version supports CUDA 9.0 toolkit. Go to https://developer.nvidia.com/cuda-90-download-archive
and
select the latest runfile installer version for Ubuntu. Download the base installer and all the available
patch files. As it is suggested in https://unix.stackexchange.com/questions/218163/how-to-install-cuda-toolkit-7-8-9-on-debian-8-jessie-or-9-stretch
do not store the files in /tmp
.
Locate the downloaded .run
files and make them executable by running
chmod a+x cuda_9.0*.run
Next, execute the base installer by running
sudo ./cuda_9.0.176_384.81_linux.run
Read and accept the licence. Then, answer the following questions as follows
You are attempting to install on an unsupported configuration. Do you wish to continue?
(y)es/(n)o [ default is no ]
Type y if this message prompts and you are sure to have an nvidia gpu that supports CUDA. You can check it by reading the specification of your gpu model on the nvidia website. Here is mine: https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-1050-ti/specifications, check Supported Technologies for CUDA support.
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit:
Type y
Do you want to install the OpenGL libraries?
(y)es/(n)o/(q)uit [ default is yes ]:
Type y
Do you want to run nvidia-xconfig?
This will update the system X configuration file so that the NVIDIA X driver
is used. The pre-existing X configuration file will be backed up.
This option should not be used on systems that require a custom
X configuration, such as systems with multiple GPU vendors.
(y)es/(n)o/(q)uit [ default is no ]:
Type n
Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit:
Type y
Enter Toolkit Location
[ default is /usr/local/cuda-9.0 ]:
Accept the default path (or change it if you prefer)
Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit:
Type y
Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit:
Type y
Enter CUDA Samples Location
[ default is /home/acco ]: press enter or change path
Accept the default path (or change it if you prefer)
This should be part of the output
Installing the NVIDIA display driver...
A system reboot is required to continue installation.
Please reboot then run the installer again.
An attempt has been made to disable Nouveau.
If this message persists after reboot, please see the display driver log file
at /var/log/nvidia-installer.log for more information.
===========
= Summary =
===========
Driver: Reboot required to continue
Toolkit: Installation skipped
Samples: Installation skipped
Apparently a reboot is needed in order to finalize the driver installation and proceed with the toolkit/samples installation.
sudo reboot
Now, for some reason the resolution may have been lowered. Don't worry too much, once the installation is complete everything should return as it originally was!
Locate and run the base installer again. Answer the questions the same as before. Keep track of the
samples directory. It should be something like /home/username/NVIDIA_CUDA-9.0_Samples/
if you used the default path.
Once the installation is complete, if the screen resolution was changed, reboot and hope for the best. Otherwise go on without rebooting.
Locate the samples and enter one of them.
cd ~/NVIDIA_CUDA-9.0_Samples/0_Simple/vectorAdd
Build the source with
make
In order to successfully build the source you should have e.g. build-essential
installed in your
system.
sudo apt install build-essential
Run the example
./vectorAdd
It should output something like
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Apparently you have succeeded! ... but you have not finished yet! Go on by installing the patches you previously downloaded along with the base installer.
sudo ./cuda_9.0.176.1_linux.run
sudo ./cuda_9.0.176.2_linux.run
sudo ./cuda_9.0.176.3_linux.run
sudo ./cuda_9.0.176.4_linux.run
For all of them, accept the licence and the default installation path if you left the same in the base installer too.
Now let's move to install the cuDNN library. Go to https://developer.nvidia.com/cudnn and press download cuDNN. Register if you don't have an account.
Once you are logged in, return to https://developer.nvidia.com/cudnn press download cuDNN again. Accept the terms, expand cuDNN library for CUDA 9.0 and download the cuDNN Runtime Library for Ubuntu 16.04 (Deb).
Install the .deb
file. You can use gdebi
.
sudo apt install gdebi
then
sudo gdebi libcudnn7_7.5.0.56-1\ cuda9.0_amd64.deb
Be careful there is a space in the file name (use tab to auto-complete!).
Update your LD_LIBRARY_PATH
by adding the following text as a newline to the
end of your .bashrc
file. You might need to update a different file if you
are using another terminal.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64/
Log out and log in again.
Done! Let's try if it works.
Install tensorflow-gpu
. If you are using pip3
, type
sudo pip3 install tensorflow-gpu
Open a new terminal session and run the python interpreter (e.g. python3
).
Execute
import tensorflow as tf
tf.test.is_gpu_available(
cuda_only=False,
min_cuda_compute_capability=None)
It should output something like
2019-02-25 11:59:05.760120: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-25 11:59:06.091764: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-25 11:59:06.092167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 3.94GiB freeMemory: 3.89GiB
2019-02-25 11:59:06.092181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-25 11:59:06.878669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-25 11:59:06.878725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-25 11:59:06.878741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-25 11:59:06.879727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 3621 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
True
Tensorflow was able to find the gpu device! That is all! This worked for me! ... and I hope it will work for you too.