TIPS之 集群中大量Pod处于UnexpectedAdmissionError状态排查
现象
查询集群中的Pod时,发现大量Pod实例处于UnexpectedAdmissionError状态。例如,执行kubectl get pod -A时,回显结果如下:
$ kubectl get pod -A
...
ai-ns xxx-avocado-0 0/1 UnexpectedAdmissionError 0 20m
ai-ns xxx-avocado-0 0/1 UnexpectedAdmissionError 0 24m
ai-ns xxx-avocado-0 0/1 UnexpectedAdmissionError 0 39m
ai-ns xxx-avocado-0 0/1 UnexpectedAdmissionError 0 45m
ai-ns xxx-avocado-0 0/1 UnexpectedAdmissionError 0 13m
ai-ns xxx-avocado-0 0/1 UnexpectedAdmissionError 0 30m
ai-ns xxx-avocado-0 0/1 UnexpectedAdmissionError 0 22m
ai-ns xxx-avocado-0 0/1 UnexpectedAdmissionError 0 12m
...
错误event:
$ kubectl describe pod xxx-avocado-0 -n ai-ns
...
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod was rejected: Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to get device link information: error calling nvml.Init: Unknown Error, which is unexpected
...
问题根因
Pod被volcano调度器调度到某个节点上后,节点资源不满足Pod申请的资源(CPU、Memory、异构资源GPU)等,导致该Pod会被Kubelet拒绝,进入终态Failed状态。
a. Device Plugin 调用 GPU libnvidia-ml.so init 失败
// ref: https://github.com/NVIDIA/go-gpuallocator/blob/main/gpuallocator/device.go
// build uses the configured options to build a DeviceList.
func (o *deviceListBuilder) build() (DeviceList, error) {
if err := o.nvmllib.Init(); err != nvml.SUCCESS {
return nil, fmt.Errorf("error calling nvml.Init: %v", err)
}
defer func() {
_ = o.nvmllib.Shutdown()
}()
...
b. libnvidia-ml.so init 失败
package nvml
import "C"
var nvmlInit = nvmlInit_v1
err := l.dl.Lookup("nvmlInit_v2")
if err == nil {
nvmlInit = nvmlInit_v2
}
// nvml.Init()
func (l *library) Init() Return {
if err := l.load(); err != nil {
return ERROR_LIBRARY_NOT_FOUND
}
return nvmlInit() // 这个是调用 libnvidia-ml.so 的 nvmlInit_v2 方法
}
解决方案
- 第一步: GC清理异常状态pod。 Kubernetes 自动清理 Failed/Successed Pod
- 第二步: 重新安装 nvidia 安装驱动
「如果这篇文章对你有用,请随意打赏」
FEATURED TAGS
agent
apiserver
application
bandwidth-limit
cgo
cgroupfs
ci/cd
client-go
cloudnative
cncf
cni
community
container
container-network-interface
containerd
controller
coredns
crd
cuda
custom-controller
deployment
device-plugin
docker
docker-build
docker-image
drop
ebpf
ecology
egress
etcd
gitee
github
gitlab
golang
governance
gpu
gpu-device
hpa
http2
image
ingress
iptables
jobs
kata
kata-runtime
kernel
kind
kubelet
kubenetes
kubernetes
library
linux-os
logging
loki
metrics
monitor
namespace
network
network-troubleshooting
node
nodeport
nvidai
ollama
pingmesh
pod
prestop
prometheus
proxyless
pvc
rollingupdate
schedule
scheduler
serverless
sglang
sidecar
sigtrem
systemd
tensorrt-llm
throttling
timeout
tools
traceroute
vllm