NVIDAI GPU节点驱动安装和问题排除
前置要求
- CentOS Linux release 7.9.2009 (Core)
- Kernel 5.4.x
- Kubernetes >= 1.10
- GCC 版本 >= 9.3.1
- NVIDAI device >= 384.81
- Runtime 支持 nvidia-container-toolkit >= 1.7.0
- cuda版本与 NVIDAI device 匹配
GPU Node 更新步骤
第一步. 出现GPU卡信号
$ lspci | grep -i nvidia
3b:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
通过PCI Device查询需求具体型号,用于最终驱动安装 地址: https://admin.pci-ids.ucw.cz/mods/PC/10de?action=help?help=pci
第二步. 更新Kernel版本
CentOS7.9
镜像默认内核为3.10
,老版本内核不安全,库也不全,所以现在需要升级Cent7.9 kernel-lt-5.4.210-1.el7
的内核版本.
内核准备
内核包,需要内核包
、devel
、headers
三种.
下载地址: http://mirrors.coreix.net/elrepo-archive-archive/kernel/el7/x86_64/RPMS/kernel-lt-headers-5.4.210-1.el7.elrepo.x86_64.rpm http://mirrors.coreix.net/elrepo-archive-archive/kernel/el7/x86_64/RPMS/kernel-lt-devel-5.4.210-1.el7.elrepo.x86_64.rpm http://mirrors.coreix.net/elrepo-archive-archive/kernel/el7/x86_64/RPMS/kernel-lt-5.4.210-1.el7.elrepo.x86_64.rpm
替换内核
$ cat /proc/version
Linux version 3.10.0-1160.92.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Jun 20 11:48:01 UTC 2023
其中 kernel版本
和 gcc版本
都不符合要求
安装新内核:
$ rpm -ivh kernel-lt-5.4.210-1.el7.elrepo.x86_64.rpm
...
$ rpm -ivh kernel-lt-devel-5.4.210-1.el7.elrepo.x86_64.rpm
...
$ rpm -ivh kernel-lt-headers-5.4.210-1.el7.elrepo.x86_64.rpm
...
# 确认内核启动参数
$ grep ^menuentry /etc/grub2.cfg | cut -f 2 -d \’
menuentry 'CentOS Linux (5.4.210-1.el7.elrepo.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-5.4.210-1.el7.elrepo.x86_64-advanced-1a7fd6ea-e743-4f01-a2a8-7c111d5f82e6' {
menuentry 'CentOS Linux (0-rescue-2b0df959bee44a0d81afd6de8565a3d0) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-0-rescue-2b0df959bee44a0d81afd6de8565a3d0-advanced-1a7fd6ea-e743-4f01-a2a8-7c111d5f82e6' {
# 重启内核
$ grub2-mkconfig -o /boot/grub2/grub.cfg && sleep 2 && reboot
# 重启后确认内核
$ uname -a
Linux node-172-30-214-25 5.4.210-1.el7.elrepo.x86_64 #1 SMP Tue Aug 9 17:41:27 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
$ rpm -qa | grep kernel
kernel-lt-devel-5.4.210-1.el7.elrepo.x86_64
kernel-lt-headers-5.4.210-1.el7.elrepo.x86_64
abrt-addon-kerneloops-2.1.11-60.el7.centos.x86_64
kernel-lt-5.4.210-1.el7.elrepo.x86_64
升级 GCC 到 9.3.1 版本
添加 gcc yum源
$ cat CentOS-SCLo-rh.repo
[centos-sclo-rh]
name=CentOS-7 - SCLo rh
baseurl=http://mirrors.aliyun.com/centos/7/sclo/$basearch/rh/
gpgcheck=1
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo
[centos-sclo-rh-testing]
name=CentOS-7 - SCLo rh Testing
baseurl=http://mirrors.aliyun.com/centos/7/sclo/$basearch/rh/
gpgcheck=0
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo
[centos-sclo-rh-source]
name=CentOS-7 - SCLo rh Sources
baseurl=http://vault.centos.org/centos/7/sclo/Source/rh/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo
[centos-sclo-rh-debuginfo]
name=CentOS-7 - SCLo rh Debuginfo
baseurl=http://debuginfo.centos.org/centos/7/sclo/$basearch/
gpgcheck=1
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-SCLo
$ yun makecache
...
$ yum -y install centos-release-scl
$ yum -y install devtoolset-9-gcc devtoolset-9-gcc-c++ devtoolset-9-binutils
$ scl enable devtoolset-9 bash
$ echo "source /opt/rh/devtoolset-9/enable" >>/etc/profile
$ source /etc/profile
// check gcc version
$ gcc --version
gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
关闭 默认GPU 驱动
// 检查是否安装 nouveau
$ lsmod | grep nouveau
nouveau 1949696 0
mxm_wmi 16384 1 nouveau
video 49152 1 nouveau
i2c_algo_bit 16384 2 mgag200,nouveau
ttm 106496 2 drm_vram_helper,nouveau
drm_kms_helper 184320 4 mgag200,nouveau
drm 491520 6 drm_kms_helper,drm_vram_helper,mgag200,ttm,nouveau
wmi 32768 5 wmi_bmof,dell_smbios,dell_wmi_descriptor,mxm_wmi,nouveau
$ cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
$ reboot
// 结果为空
$ lsmod | grep nouveau
其他系统参考:https://docs.nvidia.com/ai-enterprise/deployment/vmware/latest/nouveau.html
安装NVIDIA驱动
-
打开 NVIDIA 驱动下载链接: https://www.nvidia.cn/drivers/lookup/ 查询具体匹配的驱动
-
安装驱动
$ chmod +x NVIDIA-Linux-x86_64-515.105.01.run && sh NVIDIA-Linux-x86_64-515.105.01.run --ui=none --no-questions
- 验证驱动是否正常
$ nvidia-smi
Tue Dec 3 18:51:32 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 39C P8 16W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:AF:00.0 Off | 0 |
| N/A 56C P0 40W / 70W | 2MiB / 15360MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
-
下载cuda开发包: https://developer.nvidia.com/cuda-11-7-0-download-archive
-
安装cuda
sh cuda_11.7.0_515.43.04_linux.run
安装 nvidia-container-toolkit
- 添加yum源
$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
- 安装 NVIDIA Container Toolkit 软件包
$ sudo yum install -y nvidia-container-toolkit
- 配置 containerd, 修改
/etc/containerd/config.toml
主机上的文件
$ sudo nvidia-ctk runtime configure --runtime=containerd
- 将nvidia-container-runtime设置为默认runtime
$ cat /etc/containerd/config.toml
...
[plugins]
[plugins."io.containerd.gc.v1.scheduler"]
pause_threshold = 0.02
deletion_threshold = 0
mutation_threshold = 100
schedule_delay = "0s"
startup_delay = "100ms"
[plugins."io.containerd.grpc.v1.cri"]
disable_apparmor = false
max_concurrent_downloads = 20
max_container_log_line_size = -1
sandbox_image = "sea.hub:5000/pause:3.9"
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia" //默认runtime
snapshotter = "overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2" //类型
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2" //类型
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
...
部署nvidia k8s-device-plugin
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
$ kubectl get ds -A | grep nvidia
kube-system nvidia-device-plugin-daemonset 19 19 19 19 19 <none> 31h
$ kubectl describe node node-172-30-214-25
Name: node-172-30-214-25
...
Capacity:
cpu: 72
ephemeral-storage: 232734692Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131496188Ki
nvidia.com/gpu: 2
pods: 110
Allocatable:
cpu: 72
ephemeral-storage: 214488291793
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131393788Ki
nvidia.com/gpu: 2
pods: 110
...
验证
部署 cuda sample 案例;
apiVersion: v1
kind: Pod
metadata:
name: gpu-vectoradd
spec:
schedulerName: volcano
containers:
- name: cuda-vectoradd
image: artifacts.iflytek.com/docker-private/cloudnative/cuda/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04
command:
- /bin/bash
- '-c'
- '--'
args:
- while true; do /cuda-samples/vectorAdd; done
resources:
limits:
nvidia.com/gpu: 1 # requesting 2 GPUs
$ kubectl apply -f xx.yaml
结果
$ nvidia-smi
Tue Dec 3 19:09:08 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 39C P8 16W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:AF:00.0 Off | 0 |
| N/A 55C P0 38W / 70W | 14MiB / 15360MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 89206 C /cuda-samples/vectorAdd 10MiB |
+-----------------------------------------------------------------------------+
其他信息
Node上GPU PCIe总线互联情况
$ nvidia-smi topo --matrix
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X SYS 0-17,36-53 0
GPU1 SYS X 18-35,54-71 1
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
「如果这篇文章对你有用,请随意打赏」
如果这篇文章对你有用,请随意打赏
使用微信扫描二维码完成支付