TIPS之 Kubernetes CPUSet Pod 在各节点上numa分布不均

Kubernetes CPUSet Pod 在各节点上numa分布不均

Posted by 董江 on Wednesday, June 5, 2024

Kubernetes CPUSet Pod 在各节点上numa分布不均

背景

CPU Manager 管理的CPU亲和过程中,发现有部分机器numa空闲, 但是有部分机器Pod在 binding过程中,一致找不到合适的Node。

问题追查

发现两个现象:

1) 有部分节点后加入

[root@master-1 manifests]# kubectl get node -A
NAME                  STATUS   ROLES           AGE    VERSION
master-1   Ready    control-plane   38d    v1.28.8
master-2   Ready    control-plane   38d    v1.28.8
master-3   Ready    control-plane   38d    v1.28.8
node-1     Ready    <none>          2d1h   v1.28.8
node-2     Ready    <none>          2d1h   v1.28.8
node-3     Ready    <none>          2d1h   v1.28.8
node-4     Ready    <none>          2d1h   v1.28.8
node-5     Ready    <none>          32h    v1.28.8
node-6     Ready    <none>          32h    v1.28.8

2)部分节点异常warning: InvalidDiskCapacity: invalid capacity 0 on image filesystem

[root@master-1 manifests]# kubectl describe node node-2
......
Events:
  Type     Reason                   Age   From     Message
  ----     ------                   ----  ----     -------
  Normal   Starting                 44m   kubelet  Starting kubelet.
  Warning  InvalidDiskCapacity      44m   kubelet  invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  44m   kubelet  Node node-2 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    44m   kubelet  Node node-2 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     44m   kubelet  Node node-2 status is now: NodeHasSufficientPID
  Normal   NodeNotReady             44m   kubelet  Node node-2 status is now: NodeNotReady
  Normal   NodeAllocatableEnforced  44m   kubelet  Updated Node Allocatable limit across pods
  Normal   NodeReady                44m   kubelet  Node node-2 status is now: NodeReady

分析过程

  1. 由于使用k8s原生的调度器(scheduler)来调度Pod,在调度过程中,由于一些原因,会出现调度不均衡的问题:
  • 节点故障
  • 新节点被加到集群中
  • 节点资源利用不足
  • 亲和性和节点压力等

这些都会导致pod在调度过程中分配不均,例如会造成节点负载过高。 binding节点时候按这些因素打分,由于有节点后加入, 导致在计算分数声被减分。

不能达到: 基于 NUMA 拓扑分配 CPU 时,按用户希望有不同的分配策略。例如 bin-packing 优先,或者分配最空闲的 NUMA 节点

  1. invalid capacity 0 on image filesystem 对于gc 生效,并且数据不准确

代码地址: https://github.com/kubernetes/kubernetes/blob/release-1.28/pkg/kubelet/images/image_gc_manager.go#L293-L317

...
    // Get disk usage on disk holding images.
	fsStats, err := im.statsProvider.ImageFsStats(ctx)
	if err != nil {
		return err
	}

	var capacity, available int64
	if fsStats.CapacityBytes != nil {
		capacity = int64(*fsStats.CapacityBytes)  // 获得空值
	}
	if fsStats.AvailableBytes != nil {
		available = int64(*fsStats.AvailableBytes)
	}

	if available > capacity {
		klog.InfoS("Availability is larger than capacity", "available", available, "capacity", capacity)
		available = capacity  // 获得空值后,回将可用空间设置为空
	}

	// Check valid capacity.
	if capacity == 0 {
		err := goerrors.New("invalid capacity 0 on image filesystem")
		im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())  // InvalidDiskCapacity 信息上报
		return err
	}
...

并且导致:LowThresholdPercentLowThresholdPercent 不生效

...
    // Check valid capacity.
	if capacity == 0 {
		err := goerrors.New("invalid capacity 0 on image filesystem")
		im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
		return err  // 获得空值后,直接返回
	}

	// If over the max threshold, free enough to place us at the lower threshold.
	usagePercent := 100 - int(available*100/capacity)
	if usagePercent >= im.policy.HighThresholdPercent { // HighThresholdPercent 和 LowThresholdPercent 都不身生效
		amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
		klog.InfoS("Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold", "usage", usagePercent, "highThreshold", im.policy.HighThresholdPercent, "amountToFree", amountToFree, "lowThreshold", im.policy.LowThresholdPercent)
		freed, err := im.freeSpace(ctx, amountToFree, time.Now())
		if err != nil {
			return err
		}

		if freed < amountToFree {
			err := fmt.Errorf("Failed to garbage collect required amount of images. Attempted to free %d bytes, but only found %d bytes eligible to free.", amountToFree, freed)
			im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
			return err
		}
	}
...

解决方案

  1. 后续引入自定义调度扩展,支持bin-packing(按特性-物理核绑定)gang-scheduling(紧凑型部署)SpreadByPCPUs(分配最空闲的 NUMA 节点)等自定义策略
  1. 目前社区没有很好解决方式,跟进社区处理; 并对node出现InvalidDiskCapacity event做监控报警

「如果这篇文章对你有用,请随意打赏」

Kubeservice博客

如果这篇文章对你有用,请随意打赏

使用微信扫描二维码完成支付