TIPS之 Kubernetes Node 网络故障后 Pod Ready状态 flase 问题排查

Kubernetes Node 网络故障后 Pod Ready状态 flase 问题排查

Posted by 董江 on Wednesday, September 28, 2022

Kubernetes Node 网络故障后 Pod Ready状态 flase 问题排查

现象

网络故障,Kubernetes Node 恢复,但是 Node 节点上的 Pod的状态是 running, 但是 Conditions:Ready 状态是 False

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-11-18T12:11:20Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-09-27T02:17:29Z"
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-11-18T12:11:25Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-11-18T12:11:20Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://336c6bba9a25bc713e63c4386e42cfa2165ad530c4478be09ec0174475c5489c
    image: haproxy-cmss:v1.7.12
    imageID: docker://sha256:7a5c25e9b4740f54520f172c586f542a3a12fc44a6aa9473783f965bf2ba403b
    lastState: {}
    name: xxxxx
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-11-18T12:11:24Z"
  hostIP: 10.175.149.5
  phase: Running
  podIP: 10.222.197.153
  podIPs:
  - ip: 10.222.197.153
  qosClass: Burstable
  startTime: "2021-11-18T12:11:20Z"

分析过程

kubelet中日志分析

获得几个想象:

  1. kubeletnode lease put请求,断断续续请求错误(10:16:17-10:25:27), Node 被设置为 NodeNotReady
  2. NodeNotReady之后, 多个业务pod被驱逐重启(不包括以上pod);
  3. 此pod未重启,但是lastTransitionTime时间被更新;

查询DeploymentReplicatSetPod状态 不一致点

$ kubectl get deployment xxxxx -n xxxx  -o yaml

Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   True 
  PodScheduled      True 

status:
  conditions:
  - lastTransitionTime: "2021-07-30T08:38:16Z"
    lastUpdateTime: "2021-07-30T08:38:20Z"
    message: ReplicaSet "haproxy-79f86bb4f4" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-09-27T02:17:29Z"
    lastUpdateTime: "2022-09-27T02:17:29Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  observedGeneration: 1
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1

$ kubectl get rs  -n xxxxx        
NAME                  DESIRED   CURRENT   READY   AGE
xxxxxxxx-58fffcb49d   1         1         1       2d20h

$ kubectl get pod  -n xxxxx        
NAME                        READY   STATUS    RESTARTS   AGE
blackbox-58fffcb49d-6nlg6   1/1     Running   0          2d20h

DeploymentPod状态表现不一致

分析部分

etcd中的pod ready状态会被设置为’False'

kube-controller-manager 中有个 node lifecycle controller 定期检查 Node状态, 如果kubelet node lease 周期无数据上报, node lifecycle controller 会主动做node health monitor操作

func (nc *Controller) monitorNodeHealth() error {

	// 监听nodeLister事件
	nodes, err := nc.nodeLister.List(labels.Everything())
	if err != nil {
		return err
	}
	added, deleted, newZoneRepresentatives := nc.classifyNodes(nodes)

    // zone pod 平衡驱逐
	for i := range newZoneRepresentatives {
		nc.addPodEvictorForNewZone(newZoneRepresentatives[i])
	}

    // Node节点 ADD 处理
	for i := range added {
    //...
	}

    // Node节点 Deleted 处理
	for i := range deleted {
    // ...
	}

	zoneToNodeConditions := map[string][]*v1.NodeCondition{}
	// 巡检 node
	for i := range nodes {
		var gracePeriod time.Duration
		var observedReadyCondition v1.NodeCondition
		var currentReadyCondition *v1.NodeCondition
		node := nodes[i].DeepCopy()
		// 巡检获得apiserver最新Node信息
		if err := wait.PollImmediate(retrySleepTime, retrySleepTime*scheduler.NodeHealthUpdateRetry, func() (bool, error) {
			gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeHealth(node)
			if err == nil {
				return true, nil
			}
			name := node.Name
			node, err = nc.kubeClient.CoreV1().Nodes().Get(name, metav1.GetOptions{})
			if err != nil {
				klog.Errorf("Failed while getting a Node to retry updating node health. Probably Node %s was deleted.", name)
				return false, err
			}
			return false, nil
		}); err != nil {
			klog.Errorf("Update health of Node '%v' from Controller error: %v. "+
				"Skipping - no pods will be evicted.", node.Name, err)
			continue
		}

		// 排除不需要巡检节点
		if !isNodeExcludedFromDisruptionChecks(node) {
			zoneToNodeConditions[utilnode.GetZoneKey(node)] = append(zoneToNodeConditions[utilnode.GetZoneKey(node)], currentReadyCondition)
		}

		if currentReadyCondition != nil {
			// ... 
			// 如果在节点状态转换期间发生错误(Ready -> NotReady)需要标记节点以重试以在下一次迭代中强制执行 MarkPodsNotReady
			switch {
			case currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue:
				// Report node event only once when status changed.
				nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
				fallthrough
			case needsRetry && observedReadyCondition.Status != v1.ConditionTrue:
				if err = nodeutil.MarkPodsNotReady(nc.kubeClient, pods, node.Name); err != nil {
					utilruntime.HandleError(fmt.Errorf("unable to mark all pods NotReady on node %v: %v; queuing for retry", node.Name, err))
					nc.nodesToRetry.Store(node.Name, struct{}{})
					continue
				}
			}
		}
	}
    // ...
	return nil
}

当Node 节点从ready 到 not ready, 会在下一次 nodesToRetry 后, 将所有的pod ready 状态 设置为 false

pod 设置为不可用时,为啥没有被驱逐

没有被驱逐原因是:pod 设置容忍 pod不用被驱逐重启

 tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoSchedule
    key: qfusion/zone
    operator: Equal
    value: az02
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready    #此 not-ready 容忍
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable   #此 unreachable 容忍
    operator: Exists
    tolerationSeconds: 300

Node 状态恢复后,Pod的状态为何没有从kubelet本地读取

kubelet status_manager中 维护了 pod podStatuses

type manager struct {
	podManager kubepod.Manager
	// Map from pod UID to sync status of the corresponding pod.
	podStatuses      map[types.UID]versionedPodStatus 
	podStatusesLock  sync.RWMutex
	podStatusChannel chan podStatusSyncRequest
}

更新这个结构只有2个地方:

  1. 10秒定时器 和 kubelet pod 变更管道 更新状态
func (m *manager) Start() {

	syncTicker := time.Tick(10 * time.Second)
	// syncPod and syncBatch share the same go routine to avoid sync races.
	go wait.Forever(func() {
		select {
		case syncRequest := <-m.podStatusChannel:
			klog.V(5).Infof("Status Manager: syncing pod: %q, with status: (%d, %v) from podStatusChannel",
				syncRequest.podUID, syncRequest.status.version, syncRequest.status.status)
			m.syncPod(syncRequest.podUID, syncRequest.status)
		case <-syncTicker:
			m.syncBatch()
		}
	}, 0)
}
  1. 状态更新(10s)方法中,updateStatusInternal 更新最新状态到结构中
func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
	// 获得本地cache
	var oldStatus v1.PodStatus
	cachedStatus, isCached := m.podStatuses[pod.UID]  // 这块地方有问题, 不应在从cache中获取,应该从本地 `podutil.GetPodCondition(status, conditionType)` 获得
	if isCached {
		oldStatus = cachedStatus.status
	} else if mirrorPod, ok := m.podManager.GetMirrorPodByPod(pod); ok {
		oldStatus = mirrorPod.Status
	} else {
		oldStatus = pod.Status
	}
	// ...
	//根系最新状态 到 cache中
	newStatus := versionedPodStatus{
		status:       status,
		version:      cachedStatus.version + 1,
		podName:      pod.Name,
		podNamespace: pod.Namespace,
	}
	m.podStatuses[pod.UID] = newStatus

每次获取cache中的数据,在存入cache,一直不能拿到本地最新状态;

这个我看社区也有同学提交,但是被committer拒绝了。

https://github.com/kubernetes/kubernetes/pull/92379/files https://github.com/kubernetes/kubernetes/pull/89155/files

Committer解释

每次获取pod最新数据,会对性能和cpu有影响, 建议缓解问题修复方式 https://github.com/kubernetes/kubernetes/pull/84951

修复逻辑:启动时,本地podStatuses cache只是保存 静态pod status(kube-proxy、kube-apiserver、kube-controller-manager 和 kube-schduler) 状态信息, 等第一次循环获得所有pod状态后,在全部更新到podStatuses中。 此方法有所缓解

总结

本身次case触发条件很苛刻: 需要先node notready,其次pod 不能被驱逐 重启, 再次 node ready后, 状态更新 正好在每个10s status_manager 定时check 一次时间内 发生。

短期解决方案

  1. 添加多副本,保证有可ready pod 被添加到 service endpoints 中(此问题不容易复线)
  2. 添加alert报警, 添加deployment 和 replicaset 不一致的报警

更新kubernetes版本, 升级到1.18+

社区相识问题

相识问题 https://github.com/kubernetes/kubernetes/issues/85389

1.15.81.16.71.17.4 以上版本,以及1.18+ 版本,都添加了 缓解 patch

「如果这篇文章对你有用,请随意打赏」

Kubeservice博客

如果这篇文章对你有用,请随意打赏

使用微信扫描二维码完成支付