Kubernetes Node 网络故障后 Pod Ready状态 flase 问题排查
现象
网络故障,Kubernetes Node
恢复,但是 Node
节点上的 Pod的状态是 running
, 但是 Conditions:Ready
状态是 False
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-11-18T12:11:20Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-09-27T02:17:29Z"
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-11-18T12:11:25Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-11-18T12:11:20Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://336c6bba9a25bc713e63c4386e42cfa2165ad530c4478be09ec0174475c5489c
image: haproxy-cmss:v1.7.12
imageID: docker://sha256:7a5c25e9b4740f54520f172c586f542a3a12fc44a6aa9473783f965bf2ba403b
lastState: {}
name: xxxxx
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-11-18T12:11:24Z"
hostIP: 10.175.149.5
phase: Running
podIP: 10.222.197.153
podIPs:
- ip: 10.222.197.153
qosClass: Burstable
startTime: "2021-11-18T12:11:20Z"
分析过程
kubelet中日志分析
获得几个想象:
kubelet
的node lease
put请求,断断续续请求错误(10:16:17-10:25:27), Node 被设置为NodeNotReady
;- NodeNotReady之后, 多个业务pod被驱逐重启(不包括以上pod);
- 此pod未重启,但是
lastTransitionTime
时间被更新;
查询Deployment
、 ReplicatSet
和 Pod
状态 不一致点
$ kubectl get deployment xxxxx -n xxxx -o yaml
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
status:
conditions:
- lastTransitionTime: "2021-07-30T08:38:16Z"
lastUpdateTime: "2021-07-30T08:38:20Z"
message: ReplicaSet "haproxy-79f86bb4f4" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2022-09-27T02:17:29Z"
lastUpdateTime: "2022-09-27T02:17:29Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
observedGeneration: 1
replicas: 1
unavailableReplicas: 1
updatedReplicas: 1
$ kubectl get rs -n xxxxx
NAME DESIRED CURRENT READY AGE
xxxxxxxx-58fffcb49d 1 1 1 2d20h
$ kubectl get pod -n xxxxx
NAME READY STATUS RESTARTS AGE
blackbox-58fffcb49d-6nlg6 1/1 Running 0 2d20h
Deployment
和Pod
状态表现不一致
分析部分
etcd
中的pod ready
状态会被设置为’False'
kube-controller-manager
中有个 node lifecycle controller
定期检查 Node状态, 如果kubelet node lease
周期无数据上报, node lifecycle controller
会主动做node health monitor
操作
func (nc *Controller) monitorNodeHealth() error {
// 监听nodeLister事件
nodes, err := nc.nodeLister.List(labels.Everything())
if err != nil {
return err
}
added, deleted, newZoneRepresentatives := nc.classifyNodes(nodes)
// zone pod 平衡驱逐
for i := range newZoneRepresentatives {
nc.addPodEvictorForNewZone(newZoneRepresentatives[i])
}
// Node节点 ADD 处理
for i := range added {
//...
}
// Node节点 Deleted 处理
for i := range deleted {
// ...
}
zoneToNodeConditions := map[string][]*v1.NodeCondition{}
// 巡检 node
for i := range nodes {
var gracePeriod time.Duration
var observedReadyCondition v1.NodeCondition
var currentReadyCondition *v1.NodeCondition
node := nodes[i].DeepCopy()
// 巡检获得apiserver最新Node信息
if err := wait.PollImmediate(retrySleepTime, retrySleepTime*scheduler.NodeHealthUpdateRetry, func() (bool, error) {
gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeHealth(node)
if err == nil {
return true, nil
}
name := node.Name
node, err = nc.kubeClient.CoreV1().Nodes().Get(name, metav1.GetOptions{})
if err != nil {
klog.Errorf("Failed while getting a Node to retry updating node health. Probably Node %s was deleted.", name)
return false, err
}
return false, nil
}); err != nil {
klog.Errorf("Update health of Node '%v' from Controller error: %v. "+
"Skipping - no pods will be evicted.", node.Name, err)
continue
}
// 排除不需要巡检节点
if !isNodeExcludedFromDisruptionChecks(node) {
zoneToNodeConditions[utilnode.GetZoneKey(node)] = append(zoneToNodeConditions[utilnode.GetZoneKey(node)], currentReadyCondition)
}
if currentReadyCondition != nil {
// ...
// 如果在节点状态转换期间发生错误(Ready -> NotReady)需要标记节点以重试以在下一次迭代中强制执行 MarkPodsNotReady
switch {
case currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue:
// Report node event only once when status changed.
nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
fallthrough
case needsRetry && observedReadyCondition.Status != v1.ConditionTrue:
if err = nodeutil.MarkPodsNotReady(nc.kubeClient, pods, node.Name); err != nil {
utilruntime.HandleError(fmt.Errorf("unable to mark all pods NotReady on node %v: %v; queuing for retry", node.Name, err))
nc.nodesToRetry.Store(node.Name, struct{}{})
continue
}
}
}
}
// ...
return nil
}
当Node 节点从ready 到 not ready, 会在下一次 nodesToRetry
后, 将所有的pod ready
状态 设置为 false
pod 设置为不可用时,为啥没有被驱逐
没有被驱逐原因是:pod 设置容忍
pod不用被驱逐重启
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoSchedule
key: qfusion/zone
operator: Equal
value: az02
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready #此 not-ready 容忍
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable #此 unreachable 容忍
operator: Exists
tolerationSeconds: 300
Node 状态恢复后,Pod的状态为何没有从kubelet
本地读取
在 kubelet
status_manager
中 维护了 pod podStatuses
type manager struct {
podManager kubepod.Manager
// Map from pod UID to sync status of the corresponding pod.
podStatuses map[types.UID]versionedPodStatus
podStatusesLock sync.RWMutex
podStatusChannel chan podStatusSyncRequest
}
更新这个结构只有2个地方:
10秒
定时器 和 kubelet pod 变更管道 更新状态
func (m *manager) Start() {
syncTicker := time.Tick(10 * time.Second)
// syncPod and syncBatch share the same go routine to avoid sync races.
go wait.Forever(func() {
select {
case syncRequest := <-m.podStatusChannel:
klog.V(5).Infof("Status Manager: syncing pod: %q, with status: (%d, %v) from podStatusChannel",
syncRequest.podUID, syncRequest.status.version, syncRequest.status.status)
m.syncPod(syncRequest.podUID, syncRequest.status)
case <-syncTicker:
m.syncBatch()
}
}, 0)
}
- 状态更新(10s)方法中,
updateStatusInternal
更新最新状态到结构中
func (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {
// 获得本地cache
var oldStatus v1.PodStatus
cachedStatus, isCached := m.podStatuses[pod.UID] // 这块地方有问题, 不应在从cache中获取,应该从本地 `podutil.GetPodCondition(status, conditionType)` 获得
if isCached {
oldStatus = cachedStatus.status
} else if mirrorPod, ok := m.podManager.GetMirrorPodByPod(pod); ok {
oldStatus = mirrorPod.Status
} else {
oldStatus = pod.Status
}
// ...
//根系最新状态 到 cache中
newStatus := versionedPodStatus{
status: status,
version: cachedStatus.version + 1,
podName: pod.Name,
podNamespace: pod.Namespace,
}
m.podStatuses[pod.UID] = newStatus
每次获取cache中的数据,在存入cache,一直不能拿到本地最新状态;
这个我看社区也有同学提交,但是被committer拒绝了。
https://github.com/kubernetes/kubernetes/pull/92379/files https://github.com/kubernetes/kubernetes/pull/89155/files
Committer解释
每次获取pod最新数据,会对性能和cpu有影响, 建议缓解
问题修复方式 https://github.com/kubernetes/kubernetes/pull/84951
修复逻辑:启动时,本地podStatuses cache
只是保存 静态pod status(kube-proxy、kube-apiserver、kube-controller-manager 和 kube-schduler)
状态信息, 等第一次循环
获得所有pod状态后,在全部更新到podStatuses
中。 此方法有所缓解
总结
本身次case触发条件很苛刻: 需要先node notready
,其次pod 不能被驱逐 重启
, 再次 node ready
后, 状态更新 正好在每个10s
status_manager 定时check 一次时间内
发生。
短期解决方案
- 添加多副本,保证有可
ready pod
被添加到service endpoints
中(此问题不容易复线) - 添加
alert
报警, 添加deployment 和 replicaset 不一致的报警
更新kubernetes版本, 升级到1.18+
社区相识问题
相识问题 https://github.com/kubernetes/kubernetes/issues/85389
在 1.15.8
和 1.16.7
和 1.17.4
以上版本,以及1.18+
版本,都添加了 缓解
patch
「如果这篇文章对你有用,请随意打赏」
如果这篇文章对你有用,请随意打赏
使用微信扫描二维码完成支付