AI之 NVIDAI GPU节点驱动安装和问题排除

NVIDAI GPU节点驱动安装和问题排除

Posted by 董江 on Monday, December 2, 2024

NVIDAI GPU节点驱动安装和问题排除

前置要求

  • CentOS Linux release 7.9.2009 (Core)
  • Kernel 5.4.x
  • Kubernetes >= 1.10
  • GCC 版本 >= 9.3.1
  • NVIDAI device >= 384.81
  • Runtime 支持 nvidia-container-toolkit >= 1.7.0
  • cuda版本与 NVIDAI device 匹配

GPU Node 更新步骤

第一步. 出现GPU卡信号

$ lspci | grep -i nvidia
3b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

通过PCI Device查询需求具体型号,用于最终驱动安装 地址: https://admin.pci-ids.ucw.cz/mods/PC/10de?action=help?help=pci

第二步. 更新Kernel版本

CentOS7.9镜像默认内核为3.10,老版本内核不安全,库也不全,所以现在需要升级Cent7.9 kernel-lt-5.4.210-1.el7的内核版本.

内核准备

内核包,需要内核包develheaders三种.

下载地址: http://mirrors.coreix.net/elrepo-archive-archive/kernel/el7/x86_64/RPMS/kernel-lt-headers-5.4.210-1.el7.elrepo.x86_64.rpm http://mirrors.coreix.net/elrepo-archive-archive/kernel/el7/x86_64/RPMS/kernel-lt-devel-5.4.210-1.el7.elrepo.x86_64.rpm http://mirrors.coreix.net/elrepo-archive-archive/kernel/el7/x86_64/RPMS/kernel-lt-5.4.210-1.el7.elrepo.x86_64.rpm

替换内核

$ cat /proc/version
Linux version 3.10.0-1160.92.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Jun 20 11:48:01 UTC 2023

其中 kernel版本gcc版本都不符合要求

安装新内核:

$ rpm -ivh kernel-lt-5.4.210-1.el7.elrepo.x86_64.rpm
...
$ rpm -ivh kernel-lt-devel-5.4.210-1.el7.elrepo.x86_64.rpm
...
$ rpm -ivh kernel-lt-headers-5.4.210-1.el7.elrepo.x86_64.rpm
...

# 确认内核启动参数
$ grep ^menuentry /etc/grub2.cfg | cut -f 2 -d \’
menuentry 'CentOS Linux (5.4.210-1.el7.elrepo.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-5.4.210-1.el7.elrepo.x86_64-advanced-1a7fd6ea-e743-4f01-a2a8-7c111d5f82e6' {
menuentry 'CentOS Linux (0-rescue-2b0df959bee44a0d81afd6de8565a3d0) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-0-rescue-2b0df959bee44a0d81afd6de8565a3d0-advanced-1a7fd6ea-e743-4f01-a2a8-7c111d5f82e6' {

# 重启内核
$ grub2-mkconfig -o /boot/grub2/grub.cfg && sleep 2 && reboot

# 重启后确认内核
$ uname -a
Linux node-172-30-214-25 5.4.210-1.el7.elrepo.x86_64 #1 SMP Tue Aug 9 17:41:27 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

$ rpm -qa | grep kernel
kernel-lt-devel-5.4.210-1.el7.elrepo.x86_64
kernel-lt-headers-5.4.210-1.el7.elrepo.x86_64
abrt-addon-kerneloops-2.1.11-60.el7.centos.x86_64
kernel-lt-5.4.210-1.el7.elrepo.x86_64

升级 GCC 到 9.3.1 版本

添加 gcc yum源

$ cat CentOS-SCLo-rh.repo 
[centos-sclo-rh]
name=CentOS-7 - SCLo rh
baseurl=http://mirrors.aliyun.com/centos/7/sclo/$basearch/rh/
gpgcheck=1
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo

[centos-sclo-rh-testing]
name=CentOS-7 - SCLo rh Testing
baseurl=http://mirrors.aliyun.com/centos/7/sclo/$basearch/rh/
gpgcheck=0
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo

[centos-sclo-rh-source]
name=CentOS-7 - SCLo rh Sources
baseurl=http://vault.centos.org/centos/7/sclo/Source/rh/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo

[centos-sclo-rh-debuginfo]
name=CentOS-7 - SCLo rh Debuginfo
baseurl=http://debuginfo.centos.org/centos/7/sclo/$basearch/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo

$ yun makecache
...
$ yum -y install centos-release-scl
$ yum -y install devtoolset-9-gcc devtoolset-9-gcc-c++ devtoolset-9-binutils
$ scl enable devtoolset-9 bash
$ echo "source /opt/rh/devtoolset-9/enable" >>/etc/profile
$ source /etc/profile

// check gcc version
$ gcc --version
gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

关闭 默认GPU 驱动

// 检查是否安装 nouveau
$ lsmod | grep nouveau
nouveau              1949696  0
mxm_wmi                16384  1 nouveau
video                  49152  1 nouveau
i2c_algo_bit           16384  2 mgag200,nouveau
ttm                   106496  2 drm_vram_helper,nouveau
drm_kms_helper        184320  4 mgag200,nouveau
drm                   491520  6 drm_kms_helper,drm_vram_helper,mgag200,ttm,nouveau
wmi                    32768  5 wmi_bmof,dell_smbios,dell_wmi_descriptor,mxm_wmi,nouveau

$ cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

$ reboot

// 结果为空
$ lsmod | grep nouveau

其他系统参考:https://docs.nvidia.com/ai-enterprise/deployment/vmware/latest/nouveau.html

安装NVIDIA驱动

  1. 打开 NVIDIA 驱动下载链接: https://www.nvidia.cn/drivers/lookup/ 查询具体匹配的驱动

  2. 安装驱动

$ chmod +x NVIDIA-Linux-x86_64-515.105.01.run && sh NVIDIA-Linux-x86_64-515.105.01.run --ui=none --no-questions
  1. 验证驱动是否正常
$ nvidia-smi
Tue Dec  3 18:51:32 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   39C    P8    16W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   56C    P0    40W /  70W |      2MiB / 15360MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. 下载cuda开发包: https://developer.nvidia.com/cuda-11-7-0-download-archive

  2. 安装cuda

sh cuda_11.7.0_515.43.04_linux.run

安装 nvidia-container-toolkit

  1. 添加yum源
$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
  1. 安装 NVIDIA Container Toolkit 软件包
$ sudo yum install -y nvidia-container-toolkit
  1. 配置 containerd, 修改/etc/containerd/config.toml主机上的文件
$ sudo nvidia-ctk runtime configure --runtime=containerd
  1. 将nvidia-container-runtime设置为默认runtime
$ cat /etc/containerd/config.toml
...
[plugins]
  [plugins."io.containerd.gc.v1.scheduler"]
    pause_threshold = 0.02
    deletion_threshold = 0
    mutation_threshold = 100
    schedule_delay = "0s"
    startup_delay = "100ms"

  [plugins."io.containerd.grpc.v1.cri"]
    disable_apparmor = false
    max_concurrent_downloads = 20
    max_container_log_line_size = -1
    sandbox_image = "sea.hub:5000/pause:3.9"

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia" //默认runtime
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2" //类型

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2" //类型

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
...

部署nvidia k8s-device-plugin

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

$  kubectl get ds -A | grep nvidia
kube-system                   nvidia-device-plugin-daemonset            19        19        19      19           19          <none>                   31h

$ kubectl describe node node-172-30-214-25
Name:               node-172-30-214-25
...
Capacity:
  cpu:                72
  ephemeral-storage:  232734692Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131496188Ki
  nvidia.com/gpu:     2
  pods:               110
Allocatable:
  cpu:                72
  ephemeral-storage:  214488291793
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             131393788Ki
  nvidia.com/gpu:     2
  pods:               110
...

验证

部署 cuda sample 案例;

apiVersion: v1
kind: Pod
metadata:
  name: gpu-vectoradd
spec:
  schedulerName: volcano
  containers:
    - name: cuda-vectoradd
      image: artifacts.iflytek.com/docker-private/cloudnative/cuda/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
      command:
        - /bin/bash
        - '-c'
        - '--'
      args:
        - while true; do /cuda-samples/vectorAdd; done
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 2 GPUs
$ kubectl apply -f xx.yaml

结果

$  nvidia-smi 
Tue Dec  3 19:09:08 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   39C    P8    16W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   55C    P0    38W /  70W |     14MiB / 15360MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A     89206      C   /cuda-samples/vectorAdd            10MiB |
+-----------------------------------------------------------------------------+

其他信息

Node上GPU PCIe总线互联情况

$ nvidia-smi topo --matrix   
	   GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	SYS	0-17,36-53	0
GPU1	SYS	 X 	18-35,54-71	1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

「如果这篇文章对你有用,请随意打赏」

Kubeservice博客

如果这篇文章对你有用,请随意打赏

使用微信扫描二维码完成支付