Kubernetes CPUSet Pod 在各节点上numa分布不均

背景

在 CPU Manager 管理的CPU亲和过程中，发现有部分机器numa空闲，但是有部分机器Pod在 binding过程中，一致找不到合适的Node。

问题追查

发现两个现象：

1）有部分节点后加入

[root@master-1 manifests]# kubectl get node -A
NAME                  STATUS   ROLES           AGE    VERSION
master-1   Ready    control-plane   38d    v1.28.8
master-2   Ready    control-plane   38d    v1.28.8
master-3   Ready    control-plane   38d    v1.28.8
node-1     Ready    <none>          2d1h   v1.28.8
node-2     Ready    <none>          2d1h   v1.28.8
node-3     Ready    <none>          2d1h   v1.28.8
node-4     Ready    <none>          2d1h   v1.28.8
node-5     Ready    <none>          32h    v1.28.8
node-6     Ready    <none>          32h    v1.28.8

2）部分节点异常warning： InvalidDiskCapacity: invalid capacity 0 on image filesystem

[root@master-1 manifests]# kubectl describe node node-2
......
Events:
  Type     Reason                   Age   From     Message
  ----     ------                   ----  ----     -------
  Normal   Starting                 44m   kubelet  Starting kubelet.
  Warning  InvalidDiskCapacity      44m   kubelet  invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  44m   kubelet  Node node-2 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    44m   kubelet  Node node-2 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     44m   kubelet  Node node-2 status is now: NodeHasSufficientPID
  Normal   NodeNotReady             44m   kubelet  Node node-2 status is now: NodeNotReady
  Normal   NodeAllocatableEnforced  44m   kubelet  Updated Node Allocatable limit across pods
  Normal   NodeReady                44m   kubelet  Node node-2 status is now: NodeReady

分析过程

由于使用k8s原生的调度器（scheduler）来调度Pod，在调度过程中，由于一些原因，会出现调度不均衡的问题：

节点故障
新节点被加到集群中
节点资源利用不足
亲和性和节点压力等

这些都会导致pod在调度过程中分配不均，例如会造成节点负载过高。 binding节点时候按这些因素打分，由于有节点后加入, 导致在计算分数声被减分。

不能达到：基于 NUMA 拓扑分配 CPU 时，按用户希望有不同的分配策略。例如 bin-packing 优先，或者分配最空闲的 NUMA 节点

invalid capacity 0 on image filesystem 对于gc 生效，并且数据不准确

代码地址： https://github.com/kubernetes/kubernetes/blob/release-1.28/pkg/kubelet/images/image_gc_manager.go#L293-L317

...
    // Get disk usage on disk holding images.
	fsStats, err := im.statsProvider.ImageFsStats(ctx)
	if err != nil {
		return err
	}

	var capacity, available int64
	if fsStats.CapacityBytes != nil {
		capacity = int64(*fsStats.CapacityBytes)  // 获得空值
	}
	if fsStats.AvailableBytes != nil {
		available = int64(*fsStats.AvailableBytes)
	}

	if available > capacity {
		klog.InfoS("Availability is larger than capacity", "available", available, "capacity", capacity)
		available = capacity  // 获得空值后，回将可用空间设置为空
	}

	// Check valid capacity.
	if capacity == 0 {
		err := goerrors.New("invalid capacity 0 on image filesystem")
		im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())  // InvalidDiskCapacity 信息上报
		return err
	}
...

并且导致：LowThresholdPercent 和 LowThresholdPercent 不生效

...
    // Check valid capacity.
	if capacity == 0 {
		err := goerrors.New("invalid capacity 0 on image filesystem")
		im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())
		return err  // 获得空值后，直接返回
	}

	// If over the max threshold, free enough to place us at the lower threshold.
	usagePercent := 100 - int(available*100/capacity)
	if usagePercent >= im.policy.HighThresholdPercent { // HighThresholdPercent 和 LowThresholdPercent 都不身生效
		amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available
		klog.InfoS("Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold", "usage", usagePercent, "highThreshold", im.policy.HighThresholdPercent, "amountToFree", amountToFree, "lowThreshold", im.policy.LowThresholdPercent)
		freed, err := im.freeSpace(ctx, amountToFree, time.Now())
		if err != nil {
			return err
		}

		if freed < amountToFree {
			err := fmt.Errorf("Failed to garbage collect required amount of images. Attempted to free %d bytes, but only found %d bytes eligible to free.", amountToFree, freed)
			im.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())
			return err
		}
	}
...

解决方案

后续引入自定义调度扩展，支持bin-packing（按特性-物理核绑定）、gang-scheduling(紧凑型部署)、SpreadByPCPUs（分配最空闲的 NUMA 节点）等自定义策略

目前社区没有很好解决方式，跟进社区处理; 并对node出现InvalidDiskCapacity event做监控报警

「如果这篇文章对你有用,请随意打赏」

FEATURED TAGS

agent apiserver application bandwidth-limit cgo cgroupfs ci/cd client-go cloudnative cncf cni community container container-network-interface containerd controller coredns crd cuda custom-controller deployment device-plugin docker docker-build docker-image drop ebpf ecology egress etcd gitee github gitlab golang governance gpu-device hpa http2 image ingress iptables jobs kata kata-runtime kernel kind kubelet kubenetes kubernetes library linux-os logging loki metrics monitor namespace network network-troubleshooting node nodeport nvidai ollama pingmesh pod prestop prometheus proxyless pvc rollingupdate schedule scheduler serverless sglang sidecar sigtrem systemd tensorrt-llm throttling timeout tools traceroute vllm

TIPS之 Kubernetes CPUSet Pod 在各节点上numa分布不均