TIPS之集群中大量Pod处于UnexpectedAdmissionError状态排查

现象

查询集群中的Pod时，发现大量Pod实例处于UnexpectedAdmissionError状态。例如，执行kubectl get pod -A时，回显结果如下：

$ kubectl get pod -A
...
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               20m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               24m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               39m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               45m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               13m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               30m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               22m
ai-ns                      xxx-avocado-0                0/1     UnexpectedAdmissionError   0               12m
...

错误event：

$ kubectl describe pod xxx-avocado-0  -n ai-ns
...
Status:           Failed
Reason:           UnexpectedAdmissionError
Message:          Pod was rejected: Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to get device link information: error calling nvml.Init: Unknown Error, which is unexpected
...

问题根因

Pod被volcano调度器调度到某个节点上后，节点资源不满足Pod申请的资源（CPU、Memory、异构资源GPU）等，导致该Pod会被Kubelet拒绝，进入终态Failed状态。

a. Device Plugin 调用 GPU libnvidia-ml.so init 失败

// ref: https://github.com/NVIDIA/go-gpuallocator/blob/main/gpuallocator/device.go
// build uses the configured options to build a DeviceList.
func (o *deviceListBuilder) build() (DeviceList, error) {
	if err := o.nvmllib.Init(); err != nvml.SUCCESS {
		return nil, fmt.Errorf("error calling nvml.Init: %v", err)
	}
	defer func() {
		_ = o.nvmllib.Shutdown()
	}()
...

b. libnvidia-ml.so init 失败


package nvml

import "C"

var nvmlInit = nvmlInit_v1
err := l.dl.Lookup("nvmlInit_v2")
if err == nil {
	nvmlInit = nvmlInit_v2
}


// nvml.Init()
func (l *library) Init() Return {
	if err := l.load(); err != nil {
		return ERROR_LIBRARY_NOT_FOUND
	}
	return nvmlInit()  // 这个是调用 libnvidia-ml.so 的 nvmlInit_v2 方法
}

解决方案

第一步： GC清理异常状态pod。 Kubernetes 自动清理 Failed/Successed Pod
第二步：重新安装 nvidia 安装驱动

「如果这篇文章对你有用,请随意打赏」

FEATURED TAGS

agent apiserver application bandwidth-limit cgo cgroupfs ci/cd client-go cloudnative cncf cni community container container-network-interface containerd controller coredns crd cuda custom-controller deployment device-plugin docker docker-build docker-image drop ebpf ecology egress etcd gitee github gitlab golang governance gpu gpu-device hpa http2 image ingress iptables jobs kata kata-runtime kernel kind kubelet kubenetes kubernetes library linux-os logging loki metrics monitor namespace network network-troubleshooting node nodeport nvidai ollama operator pingmesh pod prestop prometheus proxyless pvc rollingupdate schedule scheduler serverless sglang sidecar sigtrem systemd tensorrt-llm throttling timeout tools traceroute vllm

TIPS之 集群中大量Pod处于UnexpectedAdmissionError状态排查

现象

问题根因

解决方案

CATALOG

FEATURED TAGS

TIPS之集群中大量Pod处于UnexpectedAdmissionError状态排查