TIPS之 集群中大量Pod处于UnexpectedAdmissionError状态排查

集群中大量Pod处于UnexpectedAdmissionError状态

Posted by 董江 on Wednesday, October 15, 2025

TIPS之 集群中大量Pod处于UnexpectedAdmissionError状态排查

现象

查询集群中的Pod时,发现大量Pod实例处于UnexpectedAdmissionError状态。例如,执行kubectl get pod -A时,回显结果如下:

$ kubectl get pod -A
...
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               20m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               24m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               39m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               45m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               13m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               30m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               22m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               12m
...

错误event:

$ kubectl describe pod xxx-avocado-0  -n ai-ns
...
Status:           Failed
Reason:           UnexpectedAdmissionError
Message:          Pod was rejected: Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to get device link information: error calling nvml.Init: Unknown Error, which is unexpected
...

问题根因

Podvolcano调度器调度到某个节点上后,节点资源不满足Pod申请的资源(CPU、Memory、异构资源GPU)等,导致该Pod会被Kubelet拒绝,进入终态Failed状态。

a. Device Plugin 调用 GPU libnvidia-ml.so init 失败

// ref: https://github.com/NVIDIA/go-gpuallocator/blob/main/gpuallocator/device.go
// build uses the configured options to build a DeviceList.
func (o *deviceListBuilder) build() (DeviceList, error) {
	if err := o.nvmllib.Init(); err != nvml.SUCCESS {
		return nil, fmt.Errorf("error calling nvml.Init: %v", err)
	}
	defer func() {
		_ = o.nvmllib.Shutdown()
	}()
...

b. libnvidia-ml.so init 失败


package nvml

import "C"

var nvmlInit = nvmlInit_v1
err := l.dl.Lookup("nvmlInit_v2")
if err == nil {
	nvmlInit = nvmlInit_v2
}


// nvml.Init()
func (l *library) Init() Return {
	if err := l.load(); err != nil {
		return ERROR_LIBRARY_NOT_FOUND
	}
	return nvmlInit()  // 这个是调用 libnvidia-ml.so 的 nvmlInit_v2 方法
}

解决方案

「如果这篇文章对你有用,请随意打赏」

Kubeservice博客

如果这篇文章对你有用,请随意打赏

使用微信扫描二维码完成支付