Kubernetes APIServer 异常 504 GatewayTimeout
问题一: Kubernetes APIServer 503 logging error output: service unavailable
metrics.k8s.io/v1beta1
api-resource 不存在问题
I0908 17:58:06.013828 1 httplog.go:89] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1?timeout=32s" latency="511.066µs" userAgent="kubectl/v1.21.5 (linux/amd64) kubernetes/83c85a6" srcIP="10.253.101.202:56436" resp=503 statusStack="
goroutine 178079822 [running]:
....
"
addedInfo=`logging error output: "service unavailable"`
metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod>
api-resource 不存在问题
I0908 17:56:47.382208 1 httplog.go:89] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1/namespaces/mysql-system/pods/omp-5cb4ccf866-qxvcx" latency="602.544µs" userAgent="xxxx/v0.0.0 (linux/amd64) kubernetes/$Format" srcIP="172.17.0.1:4446" resp=503 statusStack="
goroutine 177865937 [running]:
.....
"
addedInfo=`logging error output: "service unavailable"`
以上两个是类似问题,查询集群
$ kubectl api-resources
NAME SHORTNAMES APIVERSION NAMESPACED KIND
bindings v1 true Binding
componentstatuses cs v1 false ComponentStatus
configmaps cm v1 true ConfigMap
endpoints ep v1 true Endpoints
events ev v1 true Event
limitranges limits v1 true LimitRange
namespaces ns v1 false Namespace
...
$ kubectl api-resources | grep metrics
//无结果
$ kubectl get --raw /apis/metrics.k8s.io/v1beta1
Error from server (NotFound): the server could not find the requested resource
metrics.k8s.io
api-resources 是metrics-server
提供,集群中未安装,直接返回503正常
解决方式:
集群中安装metrics-server
, 通过kubectl top node/pod
确认安装成功
问题二: Kubernetes APIServer Fast watcher, slow processing.
I0908 17:58:22.138809 1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138843 1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138862 1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138879 1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138893 1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138908 1 get.go:260] "Starting watch" path="/api/v1/endpoints" resourceVersion="4864536331" labels="" fields="" timeout="5m54s"
I0908 17:58:22.138911 1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.139132 1 get.go:260] "Starting watch" path="/api/v1/namespaces/58a153a7/endpoints" resourceVersion="4864536310" labels="" fields="" timeout="8m29s"
I0908 17:58:22.139370 1 watcher.go:286] "Fast watcher, slow processing. Probably caused by slow dispatching events to watchers" outgoingEvents=100
I0908 17:58:22.139398 1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.139491 1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.139497 1 get.go:260] "Starting watch" path="/api/v1/namespaces/58a153a7/endpoints" resourceVersion="4864536310" labels="" fields="" timeout="7m51s"
I0908 17:58:22.139882 1 get.go:260] "Starting watch" path="/apis/storage.k8s.io/v1/storageclasses" resourceVersion="4864131731" labels="" fields="" timeout="7m52s"
I0908 17:58:22.140232 1 get.go:260] "Starting watch" path="/api/v1/namespaces/200cec7c/endpoints" resourceVersion="4864536328" labels="" fields="" timeout="7m51s"
I0908 17:58:22.140637 1 get.go:260] "Starting watch" path="/api/v1/namespaces/80e4f068/endpoints" resourceVersion="4864536330" labels="" fields="" timeout="6m48s"
I0908 17:58:22.141129 1 get.go:260] "Starting watch" path="/api/v1/namespaces/58a153a7/endpoints" resourceVersion="4864536331" labels="" fields="" timeout="5m22s"
list-watch
完整路径核心是:
watchChan.resultChan -> Cacher.incoming(make(chan *event, 100)) -> cacheWatcher.input
-> Cacher.outgoing(make(chan watch.Event, 100)) -> cacheWatcher.result
正如以上问题:
- 如果
incomingEvents
满了, 应该是watch实现太多变化,同步到了apiserver,需要client-go及时处理, 实际测试了不会chan卡住; - 如果
outgoingEvents
满了,有几个可能- client-go僵死,或者没有注册对应类型的回调;
- apiserver在进行
Serialize Object
转换为 json或者 yaml太慢(或者需要转换太多) - apiserver http2.0 请求过多,分发事件太慢(可能性最小)
可能解决方式:
--default-watch-cache-size:默认值 100;用于 List-Watch 的缓存池;建议 1000 或更多;
--max-mutating-requests-inflight: 默认值 200;用于 write 请求的访问频率限制;建议 800 或更高;
--max-requests-inflight: 默认值 400;用于 read 请求的访问频率限制;建议 1600 或更高;
由于 APF 在 kubernetes 1.21默认打开了, 因此单个apiserverqps
= max-mutating-requests-inflight数
+ max-requests-inflight数
问题三: Kubernetes APIServer 504 request did not complete within the allotted timeout
一共出现 374 ,其中client-go server 310
次,backup-set-controller 30
次,kubelet
17次,其他组件 17次。
I0908 17:58:21.588385 1 httplog.go:89] "HTTP" verb="GET" URI="/api/v1/namespaces/9832e9e1/endpoints/a8e506da1df-a31cb?timeout=10s" latency="15.793707045s" userAgent="xxxx/v0.0.0 (linux/amd64) kubernetes/$Format" srcIP="10.253.91.20:2861" resp=504 statusStack="
goroutine 178072212 [running]:
....
addedInfo="logging error output:\\\"{\\\"metadata\":{},\\\"status\\\":\\\"Failure\\\",\\\"message\\\":\\\"Timeout: request did not complete within the allotted timeout\\\",\\\"reason\\\":\\\"Timeout\\\",\\\"details\\\":{},\\\"code\\\":504}\\n\"\n"
奇怪的是:
....
I0908 17:58:22.119654 1 trace.go:205] Trace[380084804]: "Get" url:/api/v1/namespaces/80e4f068/endpoints/a0bde634855-856b3,user-agent:xxxx/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.253.217.9,accept:application/json, */*,protocol:HTTP/2.0 (08-Sep-2023 17:58:05.395) (total time: 16724ms):
I0908 17:58:22.119675 1 trace.go:205] Trace[1088625249]: "Get" url:/api/v1/namespaces/58a153a7/endpoints/a669f05ab20-2b3be,user-agent:xxxx/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.253.91.21,accept:application/json, */*,protocol:HTTP/2.0 (08-Sep-2023 17:58:05.395) (total time: 16723ms):
I0908 17:58:22.119705 1 trace.go:205] Trace[204849122]: "Get" url:/api/v1/namespaces/1a45bfdc/endpoints/a08bb4ed8db-d407d,user-agent:xxxx/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.253.217.9,accept:application/json, */*,protocol:HTTP/2.0 (08-Sep-2023 17:58:05.389) (total time: 16730ms)
....
GET、PUT
,请求超时都是在15s
左右。
Review code倾向于client-go 在http2.0
情况下,目前设置了请求超时,但是对于http2.0的 healthcheck超时
和 ping超时
, 使用了默认值不合理。 对于10s超时强
敏感的业务。建议一期调整。
解决方式: 环境变量调整
HTTP2_PING_TIMEOUT_SECONDS 3s
HTTP2_READ_IDLE_TIMEOUT_SECONDS 5s
HTTP2_READ_IDLE_TIMEOUT_SECONDS
是 request timeout的1/2
; HTTP2_PING_TIMEOUT_SECONDS
是request timeout的1/4
golang HTTP/2 有很多有问题的极端情况,其中死连接将被保留并在连接池中使用
。
Http2相关Issue:
- https://github.com/golang/go/issues/32388
- https://github.com/golang/go/issues/39337
- https://github.com/golang/go/issues/39750
其他
- kubernetes 实现 list-watch 的底层原理
- kubelet连接apiserver超时
- https://github.com/kubernetes/kubernetes/pull/95981
- apiserver 压测
「如果这篇文章对你有用,请随意打赏」
如果这篇文章对你有用,请随意打赏
使用微信扫描二维码完成支付