现象
业务使用PyTorch训练任务时,在一台A800节点上,出现:
import torch
torch.cuda.is_available()
会出现
RuntimeError: Unexpected error from cudaGetDeviceCount().梃cuda functions beforeNumCudaDevices()that mightan error?Error 802:system not yet initialized
但是调用
import torch
torch.cuda.device_count() # 返回8,是正常的
发现过程
2.1 首先怀疑torch版本是不是接口进行了Deprecated
检查了torch版本,cudaGetDeviceCount()是底层cuda提供的接口. 应该还是cuda本身问题
但是nvidia-smi内容一切正常
2.2 检查nvidia基础库
$ dpkg -l | grep nvidia
ii libnvidia-cfg1-575-server:amd64 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-575-server 575.57.08-0ubuntu0.24.04.2 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-575-server:amd64 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.17.8-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.17.8-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-575-server:amd64 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-egl-wayland1:amd64 1:1.1.13-1build1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-encode-575-server:amd64 575.57.08-0ubuntu0.24.04.2 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-575-server:amd64 575.57.08-0ubuntu0.24.04.2 amd64 Extra libraries for the NVIDIA Server Driver
ii libnvidia-fbc1-575-server:amd64 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-575-server:amd64 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii nvidia-compute-utils-575-server 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA compute utilities
ii nvidia-container-toolkit 1.17.8-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.17.8-1 amd64 NVIDIA Container Toolkit Base
ii nvidia-dkms-575-server-open 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA DKMS package (open kernel module)
ii nvidia-driver-575-server-open 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA driver (open kernel) metapackage
ii nvidia-fabricmanager-575 575.57.08-0ubuntu0.24.04.1 amd64 Fabric Manager for NVSwitch based systems.
ii nvidia-firmware-575-server-575.57.08 575.57.08-0ubuntu0.24.04.2 amd64 Firmware files used by the kernel module
ii nvidia-headless-no-dkms-575-server-open 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA headless metapackage - no DKMS (open kernel module)
ii nvidia-kernel-common-575-server 575.57.08-0ubuntu0.24.04.2 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-575-server-open 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA kernel source package
ii nvidia-utils-575-server 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA Server Driver support binaries
ii xserver-xorg-video-nvidia-575-server 575.57.08-0ubuntu0.24.04.2 amd64 NVIDIA binary Xorg driver
bbaez@b200-temp:/persist/bbaez$ systemctl | grep nvidia
sys-bus-pci-drivers-nvidia.device loaded active plugged /sys/bus/pci/drivers/nvidia
● nvidia-fabricmanager.service loaded failed failed NVIDIA fabric manager service
nvidia-persistenced.service loaded active running NVIDIA Persistence Daemon
其中 nvidia-fabricmanager 异常。
2.3 检查 nvidia-fabricmanager 和 IB驱动
$ journalctl -u nvidia-fabricmanager.service -l
× nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2025-12-02 10:35:32 CST; 1min 3s ago
Process: 105349 ExecStart=/usr/bin/nvidia-fabricmanager-start.sh --mode start (code=exited, status=1/FAILURE)
CPU: 39.371s
Dec 02 10:35:32 node-10-104-122-59 nv-fabricmanager[105361]: NVLink initialization failed for NodeId:0 GPU PCI bus id:00000000:42:00.0 enumIndex:1 NVLi>
Dec 02 10:35:32 node-10-104-122-59 nv-fabricmanager[105361]: NVLink initialization failed for NodeId:0 GPU PCI bus id:00000000:61:00.0 enumIndex:2 NVLi>
Dec 02 10:35:32 node-10-104-122-59 nv-fabricmanager[105361]: NVLink initialization failed for NodeId:0 GPU PCI bus id:00000000:61:00.0 enumIndex:2 NVLi>
Dec 02 10:35:32 node-10-104-122-59 nv-fabricmanager[105361]: NVLink initialization failed for NodeId:0 GPU PCI bus id:00000000:67:00.0 enumIndex:3 NVLi>
Dec 02 10:35:32 node-10-104-122-59 nvidia-fabricmanager-start.sh[105361]: global fabric manager initialization failed; `More than one NVSwitches have t>
Dec 02 10:35:32 node-10-104-122-59 nvidia-fabricmanager-start.sh[105349]: "/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg" f>
Dec 02 10:35:32 node-10-104-122-59 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Dec 02 10:35:32 node-10-104-122-59 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Dec 02 10:35:32 node-10-104-122-59 systemd[1]: Failed to start NVIDIA fabric manager service.
Dec 02 10:35:32 node-10-104-122-59 systemd[1]: nvidia-fabricmanager.service: Consumed 39.371s CPU time.
启动异常,并且nvlink NodeId全部是0,不正常
2.4 检查内核驱动
$ lsmod | grep nvidia
nvidia_uvm 1744896 4
nvidia_drm 110592 0
nvidia_modeset 1540096 1 nvidia_drm
nvidia 90492928 267 nvidia_uvm,nvidia_modeset
缺少 nvidia_nvlink
解决方案
3.1 确保 nvidia_nvlink 正常工作的最终校验清单:
- 模块已加载
$ lsmod | grep nvidia_nvlink
# 必须输出类似:nvidia_nvlink 163840 0
- 模块文件存在(与内核匹配)
$ ls /lib/modules/$(uname -r)/kernel/drivers/video/nvidia/nvidia_nvlink.ko
# 必须输出文件路径(无“不存在”报错)
- NVLink 链路状态正常
$ nvidia-smi nvlink --status
# 多数链路显示 25 GB/s(A800 标准速率),无大量 `inactive`
- CUDA 识别多卡
import torch
print(torch.cuda.device_count()) # 必须返回 8
- 多卡 NVLink 通信测试通过
import torch
torch.cuda.set_device(0)
x = torch.randn(1024, 1024).cuda()
y = x.cuda(1) # 通过 NVLink 传输
print("NVLink 通信成功!")
3.2 若仍缺失 nvidia_nvlink 的终极兜底方案(针对 A800)
如果之前的 apt 安装仍未生成该模块,直接使用 NVIDIA 官方驱动包强制安装(确保启用 NVLink 支持):
# 1. 彻底清理所有驱动残留
sudo apt-get purge nvidia-* cuda-* -y
sudo rmmod nvidia 2>/dev/null
sudo rm -rf /lib/modules/$(uname -r)/kernel/drivers/video/nvidia/
sudo depmod -a
sudo reboot
# 2. 下载 A800 专用驱动(570.195.03,强制启用 NVLink)
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/570.195.03/NVIDIA-Linux-x86_64-570.195.03.run
chmod +x NVIDIA-Linux-x86_64-570.195.03.run
# 3. 强制安装(启用 DKMS + NVLink,跳过无关组件)
sudo sh NVIDIA-Linux-x86_64-570.195.03.run \
--dkms \
--enable-nvlink \
--no-x-check \
--no-nouveau-check \
--no-opengl-files \
--silent
# 4. 验证模块加载
sudo modprobe nvidia_nvlink
lsmod | grep nvidia_nvlink
其他
「如果这篇文章对你有用,请随意打赏」
如果这篇文章对你有用,请随意打赏
使用微信扫描二维码完成支付