Nvidia driver/CUDA installation causes centos 7 to hang on boot. unable to access user interface.
I've been tasked with installing CUDA on some new servers that all came with Centos 7 installed. I followed the instructions for installing CUDA, which goes smoothly until I restart the computer. Upon restart a boot log check list is displayed, and the computer hangs there indefinitely. I can go into the command line with ctrl+alt+f2, and the good news is that CUDA works proper, the samples compile and run fine, but I'm finding no way to get the GUI working without uninstalling the NVIDIA driver and switching back to the nouveau that came with it, which breaks CUDA.
I've been tasked with installing CUDA on some new servers that all came with Centos 7 installed. I followed the instructions for installing CUDA, which goes smoothly until I restart the computer. Upon restart a boot log check list is displayed, and the computer hangs there indefinitely. I can go into the command line with ctrl+alt+f2, and the good news is that CUDA works proper, the samples compile and run fine, but I'm finding no way to get the GUI working without uninstalling the NVIDIA driver and switching back to the nouveau that came with it, which breaks CUDA.

#1
Posted 06/11/2015 05:10 PM   
You need to remove all traces of the nouveau driver, before installing the nvidia driver. Something like this: Switch to runlevel 3. as root: echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/disable-nouveau.conf dracut --force Then reboot into runlevel 3 and run the CUDA 7 runfile installer.
You need to remove all traces of the nouveau driver, before installing the nvidia driver.

Something like this:

Switch to runlevel 3.

as root:

echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/disable-nouveau.conf
dracut --force

Then reboot into runlevel 3 and run the CUDA 7 runfile installer.

#2
Posted 06/11/2015 05:16 PM   
Thanks for your help, however the system still hangs in the same spot. I unistalled the driver and cuda. I added the blacklist using the command you gave me. I found more commands to remove nouveau from google, including yum remove xorg-x11-drv-nouveau, did the dracut --force. After all this and running the installers, I have the same issues. edit: Looking at the boot log more closely. Before the boot would always hang after a different [ OK ] print out. But now the following line is always displayed, that I haven't seen before: [* ] A start job is running for Wait for Plymouth Boot Screen to Quit
Thanks for your help, however the system still hangs in the same spot.

I unistalled the driver and cuda. I added the blacklist using the command you gave me. I found more commands to remove nouveau from google, including yum remove xorg-x11-drv-nouveau, did the dracut --force. After all this and running the installers, I have the same issues.

edit:
Looking at the boot log more closely. Before the boot would always hang after a different [ OK ] print out. But now the following line is always displayed, that I haven't seen before:
[* ] A start job is running for Wait for Plymouth Boot Screen to Quit

#3
Posted 06/11/2015 06:44 PM   
I don't know the history of the system(s) nor have you provided any logs or indicated where in the boot process it is hanging. It's possible that there are other conflicting nvidia components. All of the following are as root. What is the result of running yum list nvidia-* Which version of CUDA are you trying to install? Are you using a runfile installer, or a package manager method? Do you have an nvidia GPU? Which one? What is the result of running: lspci -v |grep NV What is the result of running: dmesg |grep NVRM and dmesg |grep nouv
I don't know the history of the system(s) nor have you provided any logs or indicated where in the boot process it is hanging. It's possible that there are other conflicting nvidia components.

All of the following are as root.

What is the result of running

yum list nvidia-*

Which version of CUDA are you trying to install?
Are you using a runfile installer, or a package manager method?
Do you have an nvidia GPU? Which one?

What is the result of running:

lspci -v |grep NV

What is the result of running:

dmesg |grep NVRM

and

dmesg |grep nouv

#4
Posted 06/11/2015 06:54 PM   
The latest attempt I made I used the elrepo repository to try and install the driver, so the result of yum list nvidia-* is: Installed Packages: nvidia-x11-drv.x86_64 346.59-1.el7.elrepo Available Packages nvidia-detect.x86_64 346.59-1.el7.elrepo nvidia-x11-drv-304xx.x86_64 304.125-1.el7.elrepo nvidia-x11-drv-304xx-32bit.x86_64 304.125-1.el7.elrepo nvidia-x11-drv-32bit.x86_64 346.59-1.el7.elrepo nvidia-x11-drv-340xx.x86_64 340.76-1.el7.elrepo nvidia-x11-drv-340xx-32bit.x86_64 340.76-1.el7.elrepo The version of CUDA I'm installing is 7.0 I've tried to use both the runfile installer and the package manager method. The servers have a nvidia tesla k20c lspci -v |grep NV: 83:00.0 3D controller: NVIDIA Corporation GK110GL [Testla K20c] dmesg |grep NVRM: [ 2.058067] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 340.87 Thu Mar 19 23:39:02 PDT 2015 [ 1803.786443] NVRM: API mismatch: the client has the version 346.46, but NVRM: this kernel module has the version 340.87. Please NVRM: make sure that this kernel module and all NVIDIA driver NVRM: components have the same version [ 1803.786454] NVRM: nvidia_frontend_ioctl: minor 255, module->ioctl failed, error -22 dmesg |grep nouv: [0.0000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.4.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16.rd.lvm.lv=centos/root crashkernel=auto vconsole.keymap-us rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau [0.0000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.4.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16.rd.lvm.lv=centos/root crashkernel=auto vconsole.keymap-us rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau
The latest attempt I made I used the elrepo repository to try and install the driver, so the result of

yum list nvidia-* is:
Installed Packages:
nvidia-x11-drv.x86_64 346.59-1.el7.elrepo
Available Packages
nvidia-detect.x86_64 346.59-1.el7.elrepo
nvidia-x11-drv-304xx.x86_64 304.125-1.el7.elrepo
nvidia-x11-drv-304xx-32bit.x86_64 304.125-1.el7.elrepo
nvidia-x11-drv-32bit.x86_64 346.59-1.el7.elrepo
nvidia-x11-drv-340xx.x86_64 340.76-1.el7.elrepo
nvidia-x11-drv-340xx-32bit.x86_64 340.76-1.el7.elrepo

The version of CUDA I'm installing is 7.0

I've tried to use both the runfile installer and the package manager method.

The servers have a nvidia tesla k20c

lspci -v |grep NV:
83:00.0 3D controller: NVIDIA Corporation GK110GL [Testla K20c]

dmesg |grep NVRM:
[ 2.058067] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 340.87 Thu Mar 19 23:39:02 PDT 2015
[ 1803.786443] NVRM: API mismatch: the client has the version 346.46, but
NVRM: this kernel module has the version 340.87. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version
[ 1803.786454] NVRM: nvidia_frontend_ioctl: minor 255, module->ioctl failed, error -22

dmesg |grep nouv:
[0.0000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.4.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16.rd.lvm.lv=centos/root crashkernel=auto vconsole.keymap-us rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau
[0.0000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-229.4.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16.rd.lvm.lv=centos/root crashkernel=auto vconsole.keymap-us rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau

#5
Posted 06/11/2015 07:23 PM   
So this is a problem: [code][ 2.058067] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 340.87 Thu Mar 19 23:39:02 PDT 2015 [ 1803.786443] NVRM: API mismatch: the client has the version 346.46, but NVRM: this kernel module has the version 340.87. Please NVRM: make sure that this kernel module and all NVIDIA driver NVRM: components have the same version[/code] 346.46 is coming from CUDA 7 installer. Not sure where 340.87 is coming from, probably a repo. 340.87 cannot be used with CUDA 7. You *cannot* mix runfile and repo installation methods. When I cull through the data you have presented, I find elements of the following nvidia drivers: 346.59, 346.46, 340.87 I suggest starting over with a clean install of Centos7, switch to runlevel 3, remove nouveau, and use the CUDA 7 runfile installer (only). Alternatively, you can study the linux getting started guide, which includes tips about how to clean up when switching from one install method to the other: [url]http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#handle-uninstallation[/url]
So this is a problem:

  1. [ 2.058067] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 340.87 Thu Mar 19 23:39:02 PDT 2015
  2. [ 1803.786443] NVRM: API mismatch: the client has the version 346.46, but
  3. NVRM: this kernel module has the version 340.87. Please
  4. NVRM: make sure that this kernel module and all NVIDIA driver
  5. NVRM: components have the same version



346.46 is coming from CUDA 7 installer. Not sure where 340.87 is coming from, probably a repo. 340.87 cannot be used with CUDA 7.

You *cannot* mix runfile and repo installation methods.

When I cull through the data you have presented, I find elements of the following nvidia drivers:

346.59, 346.46, 340.87

I suggest starting over with a clean install of Centos7, switch to runlevel 3, remove nouveau, and use the CUDA 7 runfile installer (only).

Alternatively, you can study the linux getting started guide, which includes tips about how to clean up when switching from one install method to the other:

http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#handle-uninstallation

#6
Posted 06/11/2015 07:34 PM   
Scroll To Top
AAAAA