Nvidia driver/CUDA installation causes centos 7 to hang on boot. unable to access user interface.
I've been tasked with installing CUDA on some
new servers that all came with Centos 7 installed. I followed the
instructions for installing CUDA, which goes smoothly until I restart
the computer. Upon restart a boot log check list is displayed, and the
computer hangs there indefinitely. I can go into the command line with
ctrl+alt+f2, and the good news is that CUDA works proper, the samples
compile and run fine, but I'm finding no way to get the GUI working
without uninstalling the NVIDIA driver and switching back to the nouveau
that came with it, which breaks CUDA.
I've been tasked with installing CUDA on some new servers that all came
with Centos 7 installed. I followed the instructions for installing
CUDA, which goes smoothly until I restart the computer. Upon restart a
boot log check list is displayed, and the computer hangs there
indefinitely. I can go into the command line with ctrl+alt+f2, and the
good news is that CUDA works proper, the samples compile and run fine,
but I'm finding no way to get the GUI working without uninstalling the
NVIDIA driver and switching back to the nouveau that came with it, which
breaks CUDA.
You need to remove all traces of the nouveau driver, before installing the nvidia driver.
Something like this:
Switch to runlevel 3.
as root:
echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/disable-nouveau.conf
dracut --force
Then reboot into runlevel 3 and run the CUDA 7 runfile installer.
Thanks for your help, however the system still
hangs in the same spot.
I unistalled the driver and cuda. I added the blacklist using the
command you gave me. I found more commands to remove nouveau from
google, including yum remove xorg-x11-drv-nouveau, did the dracut
--force. After all this and running the installers, I have the same
issues.
edit:
Looking at the boot log more closely. Before the boot would always hang
after a different [ OK ] print out. But now the following line is always
displayed, that I haven't seen before:
[* ] A start job is running for Wait for Plymouth Boot Screen to
Quit
Thanks for your help, however the system still hangs in the same spot.
I unistalled the driver and cuda. I added the blacklist using the
command you gave me. I found more commands to remove nouveau from
google, including yum remove xorg-x11-drv-nouveau, did the dracut
--force. After all this and running the installers, I have the same
issues.
edit:
Looking at the boot log more closely. Before the boot would always hang
after a different [ OK ] print out. But now the following line is always
displayed, that I haven't seen before:
[* ] A start job is running for Wait for Plymouth Boot Screen to Quit
I don't know the history of the system(s) nor
have you provided any logs or indicated where in the boot process it is
hanging. It's possible that there are other conflicting nvidia
components.
All of the following are as root.
What is the result of running
yum list nvidia-*
Which version of CUDA are you trying to install?
Are you using a runfile installer, or a package manager method?
Do you have an nvidia GPU? Which one?
What is the result of running:
lspci -v |grep NV
What is the result of running:
dmesg |grep NVRM
and
dmesg |grep nouv
I don't know the history of the system(s) nor have you provided any logs
or indicated where in the boot process it is hanging. It's possible
that there are other conflicting nvidia components.
All of the following are as root.
What is the result of running
yum list nvidia-*
Which version of CUDA are you trying to install?
Are you using a runfile installer, or a package manager method?
Do you have an nvidia GPU? Which one?
The latest attempt I made I used the elrepo
repository to try and install the driver, so the result of
yum list nvidia-* is:
Installed Packages:
nvidia-x11-drv.x86_64 346.59-1.el7.elrepo
Available Packages
nvidia-detect.x86_64 346.59-1.el7.elrepo
nvidia-x11-drv-304xx.x86_64 304.125-1.el7.elrepo
nvidia-x11-drv-304xx-32bit.x86_64 304.125-1.el7.elrepo
nvidia-x11-drv-32bit.x86_64 346.59-1.el7.elrepo
nvidia-x11-drv-340xx.x86_64 340.76-1.el7.elrepo
nvidia-x11-drv-340xx-32bit.x86_64 340.76-1.el7.elrepo
The version of CUDA I'm installing is 7.0
I've tried to use both the runfile installer and the package manager
method.
The servers have a nvidia tesla k20c
lspci -v |grep NV:
83:00.0 3D controller: NVIDIA Corporation GK110GL [Testla K20c]
dmesg |grep NVRM:
[ 2.058067] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 340.87 Thu
Mar 19 23:39:02 PDT 2015
[ 1803.786443] NVRM: API mismatch: the client has the version 346.46,
but
NVRM: this kernel module has the version 340.87. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version
[ 1803.786454] NVRM: nvidia_frontend_ioctl: minor 255, module->ioctl
failed, error -22
dmesg |grep nouv:
[0.0000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.4.2.el7.x86_64
root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap
vconsole.font=latarcyrheb-sun16.rd.lvm.lv=centos/root crashkernel=auto
vconsole.keymap-us rhgb quiet nouveau.modeset=0
rd.driver.blacklist=nouveau
[0.0000000] Kernel command line:
BOOT_IMAGE=/vmlinuz-3.10.0-229.4.2.el7.x86_64
root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap
vconsole.font=latarcyrheb-sun16.rd.lvm.lv=centos/root crashkernel=auto
vconsole.keymap-us rhgb quiet nouveau.modeset=0
rd.driver.blacklist=nouveau
dmesg |grep NVRM:
[ 2.058067] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 340.87 Thu Mar 19 23:39:02 PDT 2015
[ 1803.786443] NVRM: API mismatch: the client has the version 346.46, but
NVRM: this kernel module has the version 340.87. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version
[ 1803.786454] NVRM: nvidia_frontend_ioctl: minor 255, module->ioctl failed, error -22
So this is a problem:
[code][ 2.058067] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 340.87
Thu Mar 19 23:39:02 PDT 2015
[ 1803.786443] NVRM: API mismatch: the client has the version 346.46,
but
NVRM: this kernel module has the version 340.87. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version[/code]
346.46 is coming from CUDA 7 installer. Not sure where 340.87 is coming
from, probably a repo. 340.87 cannot be used with CUDA 7.
You *cannot* mix runfile and repo installation methods.
When I cull through the data you have presented, I find elements of the
following nvidia drivers:
346.59, 346.46, 340.87
I suggest starting over with a clean install of Centos7, switch to
runlevel 3, remove nouveau, and use the CUDA 7 runfile installer (only).
Alternatively, you can study the linux getting started guide, which
includes tips about how to clean up when switching from one install
method to the other:
[url]http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#handle-uninstallation[/url]
[1803.786443] NVRM: API mismatch: the client has the version 346.46, but
NVRM:this kernel module has the version 340.87.Please
NVRM: make sure that this kernel moduleand all NVIDIA driver
NVRM: components have the same version
346.46 is coming from CUDA 7 installer. Not sure where 340.87 is coming from, probably a repo. 340.87 cannot be used with CUDA 7.
You *cannot* mix runfile and repo installation methods.
When I cull through the data you have presented, I find elements of the following nvidia drivers:
346.59, 346.46, 340.87
I suggest starting over with a clean install of Centos7, switch to runlevel 3, remove nouveau, and use the CUDA 7 runfile installer (only).
Alternatively, you can study the linux getting started guide, which includes tips about how to clean up when switching from one install method to the other:
Something like this:
Switch to runlevel 3.
as root:
echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/disable-nouveau.conf
dracut --force
Then reboot into runlevel 3 and run the CUDA 7 runfile installer.
I unistalled the driver and cuda. I added the blacklist using the command you gave me. I found more commands to remove nouveau from google, including yum remove xorg-x11-drv-nouveau, did the dracut --force. After all this and running the installers, I have the same issues.
edit:
Looking at the boot log more closely. Before the boot would always hang after a different [ OK ] print out. But now the following line is always displayed, that I haven't seen before:
[* ] A start job is running for Wait for Plymouth Boot Screen to Quit
All of the following are as root.
What is the result of running
yum list nvidia-*
Which version of CUDA are you trying to install?
Are you using a runfile installer, or a package manager method?
Do you have an nvidia GPU? Which one?
What is the result of running:
lspci -v |grep NV
What is the result of running:
dmesg |grep NVRM
and
dmesg |grep nouv
yum list nvidia-* is:
Installed Packages:
nvidia-x11-drv.x86_64 346.59-1.el7.elrepo
Available Packages
nvidia-detect.x86_64 346.59-1.el7.elrepo
nvidia-x11-drv-304xx.x86_64 304.125-1.el7.elrepo
nvidia-x11-drv-304xx-32bit.x86_64 304.125-1.el7.elrepo
nvidia-x11-drv-32bit.x86_64 346.59-1.el7.elrepo
nvidia-x11-drv-340xx.x86_64 340.76-1.el7.elrepo
nvidia-x11-drv-340xx-32bit.x86_64 340.76-1.el7.elrepo
The version of CUDA I'm installing is 7.0
I've tried to use both the runfile installer and the package manager method.
The servers have a nvidia tesla k20c
lspci -v |grep NV:
83:00.0 3D controller: NVIDIA Corporation GK110GL [Testla K20c]
dmesg |grep NVRM:
[ 2.058067] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 340.87 Thu Mar 19 23:39:02 PDT 2015
[ 1803.786443] NVRM: API mismatch: the client has the version 346.46, but
NVRM: this kernel module has the version 340.87. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version
[ 1803.786454] NVRM: nvidia_frontend_ioctl: minor 255, module->ioctl failed, error -22
dmesg |grep nouv:
[0.0000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.4.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16.rd.lvm.lv=centos/root crashkernel=auto vconsole.keymap-us rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau
[0.0000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.4.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16.rd.lvm.lv=centos/root crashkernel=auto vconsole.keymap-us rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau
346.46 is coming from CUDA 7 installer. Not sure where 340.87 is coming from, probably a repo. 340.87 cannot be used with CUDA 7.
You *cannot* mix runfile and repo installation methods.
When I cull through the data you have presented, I find elements of the following nvidia drivers:
346.59, 346.46, 340.87
I suggest starting over with a clean install of Centos7, switch to runlevel 3, remove nouveau, and use the CUDA 7 runfile installer (only).
Alternatively, you can study the linux getting started guide, which includes tips about how to clean up when switching from one install method to the other:
http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#handle-uninstallation