TIPS之 Kubernetes APIServer 异常504 GatewayTimeout

Kubernetes APIServer 异常 504 GatewayTimeout

问题一： Kubernetes APIServer 503 `logging error output: service unavailable`

metrics.k8s.io/v1beta1 api-resource 不存在问题

I0908 17:58:06.013828       1 httplog.go:89] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1?timeout=32s" latency="511.066µs" userAgent="kubectl/v1.21.5 (linux/amd64) kubernetes/83c85a6" srcIP="10.253.101.202:56436" resp=503 statusStack="
goroutine 178079822 [running]:
....
"
addedInfo=`logging error output: "service unavailable"`

metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod> api-resource 不存在问题

I0908 17:56:47.382208       1 httplog.go:89] "HTTP" verb="GET" URI="/apis/metrics.k8s.io/v1beta1/namespaces/mysql-system/pods/omp-5cb4ccf866-qxvcx" latency="602.544µs" userAgent="xxxx/v0.0.0 (linux/amd64) kubernetes/$Format" srcIP="172.17.0.1:4446" resp=503 statusStack="
goroutine 177865937 [running]:
.....
"
addedInfo=`logging error output: "service unavailable"`

以上两个是类似问题，查询集群

$  kubectl api-resources 
NAME                              SHORTNAMES   APIVERSION                             NAMESPACED   KIND
bindings                                       v1                                     true         Binding
componentstatuses                 cs           v1                                     false        ComponentStatus
configmaps                        cm           v1                                     true         ConfigMap
endpoints                         ep           v1                                     true         Endpoints
events                            ev           v1                                     true         Event
limitranges                       limits       v1                                     true         LimitRange
namespaces                        ns           v1                                     false        Namespace
...
$  kubectl api-resources | grep metrics
//无结果
$  kubectl get --raw /apis/metrics.k8s.io/v1beta1
Error from server (NotFound): the server could not find the requested resource

metrics.k8s.io api-resources 是metrics-server提供，集群中未安装，直接返回503正常

解决方式: 集群中安装metrics-server, 通过kubectl top node/pod 确认安装成功

问题二： Kubernetes APIServer `Fast watcher, slow processing.`

I0908 17:58:22.138809       1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138843       1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138862       1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138879       1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138893       1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.138908       1 get.go:260] "Starting watch" path="/api/v1/endpoints" resourceVersion="4864536331" labels="" fields="" timeout="5m54s"
I0908 17:58:22.138911       1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.139132       1 get.go:260] "Starting watch" path="/api/v1/namespaces/58a153a7/endpoints" resourceVersion="4864536310" labels="" fields="" timeout="8m29s"
I0908 17:58:22.139370       1 watcher.go:286] "Fast watcher, slow processing. Probably caused by slow dispatching events to watchers" outgoingEvents=100
I0908 17:58:22.139398       1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.139491       1 watcher.go:405] "Fast watcher, slow processing. Probably caused by slow decoding, user not receiving fast, or other processing logic" incomingEvents=100
I0908 17:58:22.139497       1 get.go:260] "Starting watch" path="/api/v1/namespaces/58a153a7/endpoints" resourceVersion="4864536310" labels="" fields="" timeout="7m51s"
I0908 17:58:22.139882       1 get.go:260] "Starting watch" path="/apis/storage.k8s.io/v1/storageclasses" resourceVersion="4864131731" labels="" fields="" timeout="7m52s"
I0908 17:58:22.140232       1 get.go:260] "Starting watch" path="/api/v1/namespaces/200cec7c/endpoints" resourceVersion="4864536328" labels="" fields="" timeout="7m51s"
I0908 17:58:22.140637       1 get.go:260] "Starting watch" path="/api/v1/namespaces/80e4f068/endpoints" resourceVersion="4864536330" labels="" fields="" timeout="6m48s"
I0908 17:58:22.141129       1 get.go:260] "Starting watch" path="/api/v1/namespaces/58a153a7/endpoints" resourceVersion="4864536331" labels="" fields="" timeout="5m22s"

list-watch 完整路径核心是：

watchChan.resultChan -> Cacher.incoming(make(chan *event, 100)) -> cacheWatcher.input
  -> Cacher.outgoing(make(chan watch.Event, 100)) -> cacheWatcher.result

具体代码逻辑：https://github.com/kubernetes/kubernetes/blob/v1.21.5/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L134-L135

正如以上问题：

如果incomingEvents 满了，应该是watch实现太多变化，同步到了apiserver，需要client-go及时处理, 实际测试了不会chan卡住；
如果outgoingEvents 满了，有几个可能
- client-go僵死，或者没有注册对应类型的回调；
- apiserver在进行Serialize Object 转换为 json或者 yaml太慢（或者需要转换太多）
- apiserver http2.0 请求过多，分发事件太慢（可能性最小）

可能解决方式：

--default-watch-cache-size：默认值 100；用于 List-Watch 的缓存池；建议 1000 或更多；
--max-mutating-requests-inflight: 默认值 200；用于 write 请求的访问频率限制；建议 800 或更高；
--max-requests-inflight: 默认值 400；用于 read 请求的访问频率限制；建议 1600 或更高；

由于 APF 在 kubernetes 1.21默认打开了，因此单个apiserverqps = max-mutating-requests-inflight数 + max-requests-inflight数

问题三： Kubernetes APIServer 504 `request did not complete within the allotted timeout`

一共出现 374 ，其中client-go server 310次，backup-set-controller 30次，kubelet 17次，其他组件 17次。

I0908 17:58:21.588385       1 httplog.go:89]  "HTTP" verb="GET" URI="/api/v1/namespaces/9832e9e1/endpoints/a8e506da1df-a31cb?timeout=10s" latency="15.793707045s" userAgent="xxxx/v0.0.0 (linux/amd64) kubernetes/$Format" srcIP="10.253.91.20:2861" resp=504 statusStack="
goroutine 178072212 [running]:
....
addedInfo="logging error output:\\\"{\\\"metadata\":{},\\\"status\\\":\\\"Failure\\\",\\\"message\\\":\\\"Timeout: request did not complete within the allotted timeout\\\",\\\"reason\\\":\\\"Timeout\\\",\\\"details\\\":{},\\\"code\\\":504}\\n\"\n"

奇怪的是:

....
I0908 17:58:22.119654       1 trace.go:205] Trace[380084804]: "Get" url:/api/v1/namespaces/80e4f068/endpoints/a0bde634855-856b3,user-agent:xxxx/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.253.217.9,accept:application/json, */*,protocol:HTTP/2.0 (08-Sep-2023 17:58:05.395) (total time: 16724ms):
I0908 17:58:22.119675       1 trace.go:205] Trace[1088625249]: "Get" url:/api/v1/namespaces/58a153a7/endpoints/a669f05ab20-2b3be,user-agent:xxxx/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.253.91.21,accept:application/json, */*,protocol:HTTP/2.0 (08-Sep-2023 17:58:05.395) (total time: 16723ms):
I0908 17:58:22.119705       1 trace.go:205] Trace[204849122]: "Get" url:/api/v1/namespaces/1a45bfdc/endpoints/a08bb4ed8db-d407d,user-agent:xxxx/v0.0.0 (linux/amd64) kubernetes/$Format,client:10.253.217.9,accept:application/json, */*,protocol:HTTP/2.0 (08-Sep-2023 17:58:05.389) (total time: 16730ms)
....

GET、PUT，请求超时都是在15s左右。

Review code倾向于client-go 在http2.0情况下，目前设置了请求超时，但是对于http2.0的 healthcheck超时 和 ping超时, 使用了默认值不合理。对于10s超时强敏感的业务。建议一期调整。

代码review： https://github.com/kubernetes/kubernetes/blob/v1.21.5/staging/src/k8s.io/apimachinery/pkg/util/net/http.go#L143-L189

解决方式：环境变量调整

HTTP2_PING_TIMEOUT_SECONDS  3s
HTTP2_READ_IDLE_TIMEOUT_SECONDS 5s

HTTP2_READ_IDLE_TIMEOUT_SECONDS 是 request timeout的1/2； HTTP2_PING_TIMEOUT_SECONDS是request timeout的1/4

golang HTTP/2 有很多有问题的极端情况，其中死连接将被保留并在连接池中使用。 Http2相关Issue:

其他

「如果这篇文章对你有用,请随意打赏」

TIPS之 Kubernetes APIServer 异常504 GatewayTimeout

Kubernetes APIServer 异常504 GatewayTimeout

Kubernetes APIServer 异常 504 GatewayTimeout

问题一： Kubernetes APIServer 503 `logging error output: service unavailable`

问题二： Kubernetes APIServer `Fast watcher, slow processing.`

问题三： Kubernetes APIServer 504 `request did not complete within the allotted timeout`

其他

CATALOG

FEATURED TAGS

Kubernetes APIServer 异常 504 GatewayTimeout

问题一： Kubernetes APIServer 503 logging error output: service unavailable

问题二： Kubernetes APIServer Fast watcher, slow processing.

问题三： Kubernetes APIServer 504 request did not complete within the allotted timeout

其他

CATALOG

FEATURED TAGS

问题一： Kubernetes APIServer 503 `logging error output: service unavailable`

问题二： Kubernetes APIServer `Fast watcher, slow processing.`

问题三： Kubernetes APIServer 504 `request did not complete within the allotted timeout`