Kubernetes Node 网络故障后 Pod Ready状态 flase 问题排查

现象

网络故障，Kubernetes Node 恢复，但是 Node 节点上的 Pod的状态是 running, 但是 Conditions:Ready 状态是 False

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-11-18T12:11:20Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-09-27T02:17:29Z"
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-11-18T12:11:25Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-11-18T12:11:20Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://336c6bba9a25bc713e63c4386e42cfa2165ad530c4478be09ec0174475c5489c
    image: haproxy-cmss:v1.7.12
    imageID: docker://sha256:7a5c25e9b4740f54520f172c586f542a3a12fc44a6aa9473783f965bf2ba403b
    lastState: {}
    name: xxxxx
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-11-18T12:11:24Z"
  hostIP: 10.175.149.5
  phase: Running
  podIP: 10.222.197.153
  podIPs:
  - ip: 10.222.197.153
  qosClass: Burstable
  startTime: "2021-11-18T12:11:20Z"

分析过程

kubelet中日志分析

获得几个想象：

kubelet 的 node lease put请求，断断续续请求错误（10:16:17-10:25:27）， Node 被设置为 NodeNotReady；
NodeNotReady之后，多个业务pod被驱逐重启（不包括以上pod）；
此pod未重启，但是lastTransitionTime时间被更新；

查询`Deployment`、 `ReplicatSet`和 `Pod`状态不一致点

$ kubectl get deployment xxxxx -n xxxx  -o yaml

Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   True 
  PodScheduled      True 

status:
  conditions:
  - lastTransitionTime: "2021-07-30T08:38:16Z"
    lastUpdateTime: "2021-07-30T08:38:20Z"
    message: ReplicaSet "haproxy-79f86bb4f4" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2022-09-27T02:17:29Z"
    lastUpdateTime: "2022-09-27T02:17:29Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  observedGeneration: 1
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1

$ kubectl get rs  -n xxxxx        
NAME                  DESIRED   CURRENT   READY   AGE
xxxxxxxx-58fffcb49d   1         1         1       2d20h

$ kubectl get pod  -n xxxxx        
NAME                        READY   STATUS    RESTARTS   AGE
blackbox-58fffcb49d-6nlg6   1/1     Running   0          2d20h

Deployment和Pod状态表现不一致

分析部分

`etcd`中的`pod ready`状态会被设置为’False'

kube-controller-manager 中有个 node lifecycle controller 定期检查 Node状态，如果kubelet node lease 周期无数据上报， node lifecycle controller 会主动做node health monitor操作

func (nc *Controller) monitorNodeHealth() error {

	// 监听nodeLister事件
	nodes, err := nc.nodeLister.List(labels.Everything())
	if err != nil {
		return err
	}
	added, deleted, newZoneRepresentatives := nc.classifyNodes(nodes)

    // zone pod 平衡驱逐
	for i := range newZoneRepresentatives {
		nc.addPodEvictorForNewZone(newZoneRepresentatives[i])
	}

    // Node节点 ADD 处理
	for i := range added {
    //...
	}

    // Node节点 Deleted 处理
	for i := range deleted {
    // ...
	}

	zoneToNodeConditions := map[string][]*v1.NodeCondition{}
	// 巡检 node
	for i := range nodes {
		var gracePeriod time.Duration
		var observedReadyCondition v1.NodeCondition
		var currentReadyCondition *v1.NodeCondition
		node := nodes[i].DeepCopy()
		// 巡检获得apiserver最新Node信息
		if err := wait.PollImmediate(retrySleepTime, retrySleepTime*scheduler.NodeHealthUpdateRetry, func() (bool, error) {
			gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeHealth(node)
			if err == nil {
				return true, nil
			}
			name := node.Name
			node, err = nc.kubeClient.CoreV1().Nodes().Get(name, metav1.GetOptions{})
			if err != nil {
				klog.Errorf("Failed while getting a Node to retry updating node health. Probably Node %s was deleted.", name)
				return false, err
			}
			return false, nil
		}); err != nil {
			klog.Errorf("Update health of Node '%v' from Controller error: %v. "+
				"Skipping - no pods will be evicted.", node.Name, err)
			continue
		}

		// 排除不需要巡检节点
		if !isNodeExcludedFromDisruptionChecks(node) {
			zoneToNodeConditions[utilnode.GetZoneKey(node)] = append(zoneToNodeConditions[utilnode.GetZoneKey(node)], currentReadyCondition)
		}

		if currentReadyCondition != nil {
			// ... 
			// 如果在节点状态转换期间发生错误（Ready -> NotReady）需要标记节点以重试以在下一次迭代中强制执行 MarkPodsNotReady
			switch {
			case currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue:
				// Report node event only once when status changed.
				nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
				fallthrough
			case needsRetry && observedReadyCondition.Status != v1.ConditionTrue:
				if err = nodeutil.MarkPodsNotReady(nc.kubeClient, pods, node.Name); err != nil {
					utilruntime.HandleError(fmt.Errorf("unable to mark all pods NotReady on node %v: %v; queuing for retry", node.Name, err))
					nc.nodesToRetry.Store(node.Name, struct{}{})
					continue
				}
			}
		}
	}
    // ...
	return nil
}

当Node 节点从ready 到 not ready，会在下一次 nodesToRetry 后，将所有的pod ready 状态设置为 false

pod 设置为不可用时，为啥没有被驱逐

没有被驱逐原因是：pod 设置容忍 pod不用被驱逐重启

 tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoSchedule
    key: qfusion/zone
    operator: Equal
    value: az02
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready    #此 not-ready 容忍
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable   #此 unreachable 容忍
    operator: Exists
    tolerationSeconds: 300

Node 状态恢复后，Pod的状态为何没有从`kubelet`本地读取

在 kubelet status_manager中维护了 pod podStatuses

type manager struct {
	podManager kubepod.Manager
	// Map from pod UID to sync status of the corresponding pod.
	podStatuses      map[types.UID]versionedPodStatus 
	podStatusesLock  sync.RWMutex
	podStatusChannel chan podStatusSyncRequest
}

更新这个结构只有2个地方：

10秒定时器和 kubelet pod 变更管道更新状态

func (m *manager) Start() {

	syncTicker := time.Tick(10 * time.Second)
	// syncPod and syncBatch share the same go routine to avoid sync races.
	go wait.Forever(func() {
		select {
		case syncRequest := <-m.podStatusChannel:
			klog.V(5).Infof("Status Manager: syncing pod: %q, with status: (%d, %v) from podStatusChannel",
				syncRequest.podUID, syncRequest.status.version, syncRequest.status.status)
			m.syncPod(syncRequest.podUID, syncRequest.status)
		case <-syncTicker:
			m.syncBatch()
		}
	}, 0)
}

状态更新(10s)方法中，updateStatusInternal 更新最新状态到结构中

func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
	// 获得本地cache
	var oldStatus v1.PodStatus
	cachedStatus, isCached := m.podStatuses[pod.UID]  // 这块地方有问题， 不应在从cache中获取，应该从本地 `podutil.GetPodCondition(status, conditionType)` 获得
	if isCached {
		oldStatus = cachedStatus.status
	} else if mirrorPod, ok := m.podManager.GetMirrorPodByPod(pod); ok {
		oldStatus = mirrorPod.Status
	} else {
		oldStatus = pod.Status
	}
	// ...
	//根系最新状态 到 cache中
	newStatus := versionedPodStatus{
		status:       status,
		version:      cachedStatus.version + 1,
		podName:      pod.Name,
		podNamespace: pod.Namespace,
	}
	m.podStatuses[pod.UID] = newStatus

每次获取cache中的数据，在存入cache，一直不能拿到本地最新状态；

这个我看社区也有同学提交，但是被committer拒绝了。

https://github.com/kubernetes/kubernetes/pull/92379/files https://github.com/kubernetes/kubernetes/pull/89155/files

Committer解释

每次获取pod最新数据，会对性能和cpu有影响，建议缓解问题修复方式 https://github.com/kubernetes/kubernetes/pull/84951

修复逻辑：启动时，本地podStatuses cache只是保存 静态pod status(kube-proxy、kube-apiserver、kube-controller-manager 和 kube-schduler) 状态信息, 等第一次循环获得所有pod状态后，在全部更新到podStatuses中。此方法有所缓解

总结

本身次case触发条件很苛刻：需要先node notready，其次pod 不能被驱逐重启，再次 node ready后，状态更新正好在每个10s status_manager 定时check 一次时间内 发生。

短期解决方案

添加多副本，保证有可ready pod 被添加到 service endpoints 中（此问题不容易复线）
添加alert报警, 添加deployment 和 replicaset 不一致的报警

更新kubernetes版本, 升级到1.18+

社区相识问题

相识问题 https://github.com/kubernetes/kubernetes/issues/85389

在 1.15.8 和 1.16.7 和 1.17.4 以上版本，以及1.18+ 版本，都添加了 缓解 patch

「如果这篇文章对你有用,请随意打赏」

TIPS之 Kubernetes Node 网络故障后 Pod Ready状态 flase 问题排查