Kubernetes CPUSet Pod 在各节点上numa分布不均
背景
在 CPU Manager
管理的CPU亲和过程中,发现有部分机器numa空闲
, 但是有部分机器Pod在 binding过程中,一致找不到合适的Node。
问题追查
发现两个现象:
1) 有部分节点后加入
[root@master-1 manifests]# kubectl get node -A
NAME STATUS ROLES AGE VERSION
master-1 Ready control-plane 38d v1.28.8
master-2 Ready control-plane 38d v1.28.8
master-3 Ready control-plane 38d v1.28.8
node-1 Ready <none> 2d1h v1.28.8
node-2 Ready <none> 2d1h v1.28.8
node-3 Ready <none> 2d1h v1.28.8
node-4 Ready <none> 2d1h v1.28.8
node-5 Ready <none> 32h v1.28.8
node-6 Ready <none> 32h v1.28.8
2)部分节点异常warning: InvalidDiskCapacity: invalid capacity 0 on image filesystem
[root@master-1 manifests]# kubectl describe node node-2
......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 44m kubelet Starting kubelet.
Warning InvalidDiskCapacity 44m kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 44m kubelet Node node-2 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 44m kubelet Node node-2 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 44m kubelet Node node-2 status is now: NodeHasSufficientPID
Normal NodeNotReady 44m kubelet Node node-2 status is now: NodeNotReady
Normal NodeAllocatableEnforced 44m kubelet Updated Node Allocatable limit across pods
Normal NodeReady 44m kubelet Node node-2 status is now: NodeReady
分析过程
- 由于使用k8s原生的调度器(scheduler)来调度Pod,在调度过程中,由于一些原因,会出现调度不均衡的问题:
- 节点故障
- 新节点被加到集群中
- 节点资源利用不足
- 亲和性和节点压力等
这些都会导致pod在调度过程中分配不均,例如会造成节点负载过高。
binding节点时候按这些因素打分,由于有节点后加入
, 导致在计算分数声被减分。
不能达到: 基于 NUMA 拓扑分配 CPU 时,按用户希望有不同的分配策略。例如 bin-packing 优先
,或者分配最空闲的 NUMA 节点
invalid capacity 0 on image filesystem
对于gc 生效,并且数据不准确
...
// Get disk usage on disk holding images.
fsStats, err := im.statsProvider.ImageFsStats(ctx)
if err != nil {
return err
}
var capacity, available int64
if fsStats.CapacityBytes != nil {
capacity = int64(*fsStats.CapacityBytes) // 获得空值
}
if fsStats.AvailableBytes != nil {
available = int64(*fsStats.AvailableBytes)
}
if available > capacity {
klog.InfoS("Availability is larger than capacity", "available", available, "capacity", capacity)
available = capacity // 获得空值后,回将可用空间设置为空
}
// Check valid capacity.
if capacity == 0 {
err := goerrors.New("invalid capacity 0 on image filesystem")
im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error()) // InvalidDiskCapacity 信息上报
return err
}
...
并且导致:LowThresholdPercent
和 LowThresholdPercent
不生效
...
// Check valid capacity.
if capacity == 0 {
err := goerrors.New("invalid capacity 0 on image filesystem")
im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
return err // 获得空值后,直接返回
}
// If over the max threshold, free enough to place us at the lower threshold.
usagePercent := 100 - int(available*100/capacity)
if usagePercent >= im.policy.HighThresholdPercent { // HighThresholdPercent 和 LowThresholdPercent 都不身生效
amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
klog.InfoS("Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold", "usage", usagePercent, "highThreshold", im.policy.HighThresholdPercent, "amountToFree", amountToFree, "lowThreshold", im.policy.LowThresholdPercent)
freed, err := im.freeSpace(ctx, amountToFree, time.Now())
if err != nil {
return err
}
if freed < amountToFree {
err := fmt.Errorf("Failed to garbage collect required amount of images. Attempted to free %d bytes, but only found %d bytes eligible to free.", amountToFree, freed)
im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
return err
}
}
...
解决方案
- 后续引入自定义调度扩展,支持
bin-packing(按特性-物理核绑定)
、gang-scheduling(紧凑型部署)
、SpreadByPCPUs(分配最空闲的 NUMA 节点
)等自定义策略
- 目前社区没有很好解决方式,跟进
社区处理
; 并对node出现InvalidDiskCapacity
event做监控报警
- https://github.com/kubernetes/kubernetes/issues/113066
- https://github.com/kubernetes/kubernetes/issues/106420
- https://github.com/google/cadvisor/issues/3234
「如果这篇文章对你有用,请随意打赏」
如果这篇文章对你有用,请随意打赏
使用微信扫描二维码完成支付