AI之 NVIDAI:A800 nvlink 异常问题

A800 nvlink 异常问题

Posted by 董江 on Tuesday, December 2, 2025

现象

业务使用PyTorch训练任务时,在一台A800节点上,出现:

import torch
torch.cuda.is_available()

会出现

RuntimeError: Unexpected error from cudaGetDeviceCount().梃cuda functions beforeNumCudaDevices()that mightan error?Error 802:system not yet initialized

但是调用

import torch
torch.cuda.device_count() # 返回8,是正常的

发现过程

2.1 首先怀疑torch版本是不是接口进行了Deprecated

检查了torch版本,cudaGetDeviceCount()是底层cuda提供的接口. 应该还是cuda本身问题

但是nvidia-smi内容一切正常

2.2 检查nvidia基础库

$ dpkg -l | grep nvidia
ii  libnvidia-cfg1-575-server:amd64         575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-575-server             575.57.08-0ubuntu0.24.04.2                    all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-575-server:amd64      575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA libcompute package
ii  libnvidia-container-tools               1.17.8-1                                      amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64              1.17.8-1                                      amd64        NVIDIA container runtime library
ii  libnvidia-decode-575-server:amd64       575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-egl-wayland1:amd64            1:1.1.13-1build1                              amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-encode-575-server:amd64       575.57.08-0ubuntu0.24.04.2                    amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-575-server:amd64        575.57.08-0ubuntu0.24.04.2                    amd64        Extra libraries for the NVIDIA Server Driver
ii  libnvidia-fbc1-575-server:amd64         575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-575-server:amd64           575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  nvidia-compute-utils-575-server         575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA compute utilities
ii  nvidia-container-toolkit                1.17.8-1                                      amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base           1.17.8-1                                      amd64        NVIDIA Container Toolkit Base
ii  nvidia-dkms-575-server-open             575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA DKMS package (open kernel module)
ii  nvidia-driver-575-server-open           575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA driver (open kernel) metapackage
ii  nvidia-fabricmanager-575                575.57.08-0ubuntu0.24.04.1                    amd64        Fabric Manager for NVSwitch based systems.
ii  nvidia-firmware-575-server-575.57.08    575.57.08-0ubuntu0.24.04.2                    amd64        Firmware files used by the kernel module
ii  nvidia-headless-no-dkms-575-server-open 575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA headless metapackage - no DKMS (open kernel module)
ii  nvidia-kernel-common-575-server         575.57.08-0ubuntu0.24.04.2                    amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-575-server-open    575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA kernel source package
ii  nvidia-utils-575-server                 575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA Server Driver support binaries
ii  xserver-xorg-video-nvidia-575-server    575.57.08-0ubuntu0.24.04.2                    amd64        NVIDIA binary Xorg driver
bbaez@b200-temp:/persist/bbaez$ systemctl | grep nvidia
  sys-bus-pci-drivers-nvidia.device                                                                    loaded active plugged   /sys/bus/pci/drivers/nvidia
● nvidia-fabricmanager.service                                                                         loaded failed failed    NVIDIA fabric manager service
  nvidia-persistenced.service                                                                          loaded active running   NVIDIA Persistence Daemon

其中 nvidia-fabricmanager 异常。

2.3 检查 nvidia-fabricmanagerIB驱动

$ journalctl -u nvidia-fabricmanager.service -l
× nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2025-12-02 10:35:32 CST; 1min 3s ago
    Process: 105349 ExecStart=/usr/bin/nvidia-fabricmanager-start.sh --mode start (code=exited, status=1/FAILURE)
        CPU: 39.371s

Dec 02 10:35:32 node-10-104-122-59 nv-fabricmanager[105361]: NVLink initialization failed for NodeId:0 GPU PCI bus id:00000000:42:00.0 enumIndex:1 NVLi>
Dec 02 10:35:32 node-10-104-122-59 nv-fabricmanager[105361]: NVLink initialization failed for NodeId:0 GPU PCI bus id:00000000:61:00.0 enumIndex:2 NVLi>
Dec 02 10:35:32 node-10-104-122-59 nv-fabricmanager[105361]: NVLink initialization failed for NodeId:0 GPU PCI bus id:00000000:61:00.0 enumIndex:2 NVLi>
Dec 02 10:35:32 node-10-104-122-59 nv-fabricmanager[105361]: NVLink initialization failed for NodeId:0 GPU PCI bus id:00000000:67:00.0 enumIndex:3 NVLi>
Dec 02 10:35:32 node-10-104-122-59 nvidia-fabricmanager-start.sh[105361]: global fabric manager initialization failed; `More than one NVSwitches have t>
Dec 02 10:35:32 node-10-104-122-59 nvidia-fabricmanager-start.sh[105349]: "/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg" f>
Dec 02 10:35:32 node-10-104-122-59 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Dec 02 10:35:32 node-10-104-122-59 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Dec 02 10:35:32 node-10-104-122-59 systemd[1]: Failed to start NVIDIA fabric manager service.
Dec 02 10:35:32 node-10-104-122-59 systemd[1]: nvidia-fabricmanager.service: Consumed 39.371s CPU time.

启动异常,并且nvlink NodeId全部是0,不正常

2.4 检查内核驱动

$  lsmod | grep nvidia
nvidia_uvm           1744896  4
nvidia_drm            110592  0
nvidia_modeset       1540096  1 nvidia_drm
nvidia              90492928  267 nvidia_uvm,nvidia_modeset

缺少 nvidia_nvlink

解决方案

  1. 模块已加载
$ lsmod | grep nvidia_nvlink
# 必须输出类似:nvidia_nvlink         163840  0
  1. 模块文件存在(与内核匹配)
$ ls /lib/modules/$(uname -r)/kernel/drivers/video/nvidia/nvidia_nvlink.ko
# 必须输出文件路径(无“不存在”报错)
  1. NVLink 链路状态正常
$ nvidia-smi nvlink --status
# 多数链路显示 25 GB/s(A800 标准速率),无大量 `inactive`
  1. CUDA 识别多卡
import torch
print(torch.cuda.device_count())  # 必须返回 8
  1. 多卡 NVLink 通信测试通过
import torch
torch.cuda.set_device(0)
x = torch.randn(1024, 1024).cuda()
y = x.cuda(1)  # 通过 NVLink 传输
print("NVLink 通信成功!")

如果之前的 apt 安装仍未生成该模块,直接使用 NVIDIA 官方驱动包强制安装(确保启用 NVLink 支持):

# 1. 彻底清理所有驱动残留
sudo apt-get purge nvidia-* cuda-* -y
sudo rmmod nvidia 2>/dev/null
sudo rm -rf /lib/modules/$(uname -r)/kernel/drivers/video/nvidia/
sudo depmod -a
sudo reboot

# 2. 下载 A800 专用驱动(570.195.03,强制启用 NVLink)
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/570.195.03/NVIDIA-Linux-x86_64-570.195.03.run
chmod +x NVIDIA-Linux-x86_64-570.195.03.run

# 3. 强制安装(启用 DKMS + NVLink,跳过无关组件)
sudo sh NVIDIA-Linux-x86_64-570.195.03.run \
  --dkms \
  --enable-nvlink \
  --no-x-check \
  --no-nouveau-check \
  --no-opengl-files \
  --silent

# 4. 验证模块加载
sudo modprobe nvidia_nvlink
lsmod | grep nvidia_nvlink

其他

-nvidia nvlink

「如果这篇文章对你有用,请随意打赏」

Kubeservice博客

如果这篇文章对你有用,请随意打赏

使用微信扫描二维码完成支付