Kubernetes Numa CPU亲和失败问题排查

背景

在配置numa拓扑管理和 CPU管理后(Kubernetes Pod/Container NUMA亲和管理)，业务CPU亲和不生效；

spec:
  containers:
  - name: myapp-container
    image: busybox:1.28
    resources:
      limits:
        cpu: 2
        memory: 1Gi
      requests:
        cpu: 2
        memory: 1Gi
  initContainers:
  - name: init-myservice
    image: busybox:1.28

分析过程

Kubelet 日志分析

问题1 : kubelet 日志异常

● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since 四 2024-05-09 17:13:41 CST; 6 days ago
     Docs: http://kubernetes.io/docs/
  Process: 55204 ExecStartPre=/usr/bin/kubelet-pre-start.sh (code=exited, status=0/SUCCESS)
 Main PID: 55226 (kubelet)
    Tasks: 104
   Memory: 222.3M
   CGroup: /system.slice/kubelet.service
           └─55226 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///run/containerd/containerd.sock --pod-infra...

5月 16 12:39:33 node-10-107-65-32 kubelet[55226]: rpc error: code = Unknown desc = failed to update resources: failed to update resources: runc did not terminate successfully: exit status 1: unable to freeze
5月 16 12:39:33 node-10-107-65-32 kubelet[55226]: : unknown
5月 16 12:39:33 node-10-107-65-32 kubelet[55226]: > containerID="f9c8dabff74b210ce64987abed0d6e3a4104b2f0cfb7904d42011cb23521918f"
5月 16 12:39:33 node-10-107-65-32 kubelet[55226]: E0516 12:39:33.998533   55226 cpu_manager.go:476] "ReconcileState: failed to update container" err=<
5月 16 12:39:33 node-10-107-65-32 kubelet[55226]: rpc error: code = Unknown desc = failed to update resources: failed to update resources: runc did not terminate successfully: exit status 1: unable to freeze
5月 16 12:39:33 node-10-107-65-32 kubelet[55226]: : unknown
5月 16 12:39:33 node-10-107-65-32 kubelet[55226]: > pod="autocar-baic-chaos-hd/esboost-7b9d6f55fd-wwjcz" containerName="app" containerID="f9c8dabff74b210ce64987abed0d6e3a4104b2f0cfb7904d42011cb23521918f" cpuSet="0-1,18-49,66-95"
5月 16 12:39:39 node-10-107-65-32 kubelet[55226]: I0516 12:39:39.075187   55226 state_mem.go:80] "Updated desired CPUSet" podUID="6e48e804-ce18-4658-a11c-c01f90ed8723" containerName="app" cpuSet="0-1,18-49,66-95"
5月 16 13:34:35 node-10-107-65-32 kubelet[55226]: I0516 13:34:35.600214   55226 scope.go:117] "RemoveContainer" containerID="19802f2886c28e4a36460f4e1a0874e907b11871be5bcb49637e71fab6ad87da"
5月 16 13:34:35 node-10-107-65-32 kubelet[55226]: I0516 13:34:35.600535   55226 scope.go:117] "RemoveContainer" containerID="2b3f55dc806bb42768bf86b139892369c5404d11e6aeccc453cd5ebc224973d4"

出现一个error： cpu_manager 中更新容器的cpuset状态异常；

分析 K8s 源码： cpu_manager 代码

			// Once we make it here we know we have a running container.
			// Idempotently add it to the containerMap incase it is missing.
			// This can happen after a kubelet restart, for example.
			m.containerMap.Add(string(pod.UID), container.Name, containerID)
			m.Unlock()

			cset := m.state.GetCPUSetOrDefault(string(pod.UID), container.Name)
			if cset.IsEmpty() {
				// NOTE: This should not happen outside of tests.
				klog.V(4).InfoS("ReconcileState: skipping container; assigned cpuset is empty", "pod", klog.KObj(pod), "containerName", container.Name)
				failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
				continue
			}

			lcset := m.lastUpdateState.GetCPUSetOrDefault(string(pod.UID), container.Name)
			if !cset.Equals(lcset) {
				klog.V(4).InfoS("ReconcileState: updating container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset)
				err = m.updateContainerCPUSet(ctx, containerID, cset)
				if err != nil {
					klog.ErrorS(err, "ReconcileState: failed to update container", "pod", klog.KObj(pod), "containerName", container.Name, "containerID", containerID, "cpuSet", cset)
					failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
					continue
				}
				m.lastUpdateState.SetCPUSet(string(pod.UID), container.Name, cset)
			}
			success = append(success, reconciledContainer{pod.Name, container.Name, containerID})

问题是Kubelet启动后，会对/var/lib/kubelet/cpu_manager_state下的文件更新更新，并设置最新的CPUSet相关容器信息。次问题出现在于更新失败，由于 kubelet重启之前没有删除cpu_manager_state 导致merge信息失败。

解决方法：

腾空节点。
停止 kubelet。
删除旧的 CPU 管理器状态文件。该文件的路径默认为 /var/lib/kubelet/cpu_manager_state。这将清除 CPUManager 维护的状态，以便新策略设置的 cpu-sets 不会与之冲突。
编辑 kubelet 配置以将 CPU 管理器策略更改为所需的值。
启动 kubelet. systemctl restart kubelet

问题2: 重启kubelet异常

[root@node-10-107-65-32 kubelet]# systemctl stop kubelet && systemctl status kubelet && systemctl start kubelet
Error: Too many open files

观测业务日志中页出现：

failed to create fsnotify watcher: too many open files

是由于 单个进程 允许打开的文件fd受限，因此需要更新系统参数。

解决方式：

更新 Node节点上进程级别能够打开的文件句柄的数量。调整 Linux nofile & nproc

cat >> /etc/security/limits.conf <<EOF
# user file fd 
* soft nofile 655360
* hard nofile 655360
# user max process
* soft nproc  15412410
* hard nproc  15412410
EOF

更新内核参数支持单个user(PID) 支持的打开的文件fd数

$ cat >> <<EOF
fs.file-max=2097152 #系统级别的能够打开的文件句柄的数量
fs.inotify.max_user_instances=8192 #每一个real user ID可创建的inotify instatnces的数量上限. 默认 128
fs.inotify.max_queued_events=16384 #同一用户同时可以添加的watch数目. 默认 1024
EOF

$ sysctl -p

问题3: 带 initContainer 的Deployment 对CPUSet不生效问题

查询 cpumanager/policy_static.go 源代码:

func (p *staticPolicy) updateCPUsToReuse(pod *v1.Pod, container *v1.Container, cset cpuset.CPUSet) {
	// If pod entries to m.cpusToReuse other than the current pod exist, delete them.
	for podUID := range p.cpusToReuse {
		if podUID != string(pod.UID) {
			delete(p.cpusToReuse, podUID)
		}
	}
	// If no cpuset exists for cpusToReuse by this pod yet, create one.
	if _, ok := p.cpusToReuse[string(pod.UID)]; !ok {
		p.cpusToReuse[string(pod.UID)] = cpuset.New()
	}
	
	// Check if the container is an init container.
	// If so, add its cpuset to the cpuset of reusable CPUs for any new allocations.

	for _, initContainer := range pod.Spec.InitContainers {
		if container.Name == initContainer.Name {
			if types.IsRestartableInitContainer(&initContainer) { # 这部分check initContainer 是否reuse Container的CPUSet资源
				// If the container is a restartable init container, we should not
				// reuse its cpuset, as a restartable init container can run with
				// regular containers.
				break
			}
			p.cpusToReuse[string(pod.UID)] = p.cpusToReuse[string(pod.UID)].Union(cset)
			return
		}
	}

因此，找到原因：如果initContainer 的资源和Container资源不一致，就会break。不会更新/var/lib/kubelet/cpu_manager_state 和容器的CPUSet设置需要将 InitContainer 的 resources 设置，需要和 Container 的 resources 资源设置一样。

解决方式

更改业务 deployment yaml:

spec:
  containers:
  - name: myapp-container
    image: busybox:1.28
    resources:  # 这部分信息
      limits:
        cpu: 2
        memory: 1Gi
      requests:
        cpu: 2
        memory: 1Gi
  initContainers:
  - name: init-myservice
    image: busybox:1.28
    resources: # 与这部分信息，需要相同
      limits:
        cpu: 2
        memory: 1Gi
      requests:
        cpu: 2
        memory: 1Gi

将 containers 和 initContainers 设置一样。

其他

Pod with initcontainer runs into TopologyAffinityError

「如果这篇文章对你有用,请随意打赏」

TIPS之 Kubernetes Numa CPU亲和失败问题排查

Kubernetes Numa CPU亲和失败问题排查

Kubernetes Numa CPU亲和失败问题排查

背景

分析过程

Kubelet 日志分析

问题1 : kubelet 日志异常

解决方法：

问题2: 重启kubelet异常

解决方式：

问题3: 带 initContainer 的Deployment 对CPUSet不生效问题

解决方式

其他

CATALOG

FEATURED TAGS