技术方案之 对 Kubernetes Pod进程网络带宽 流量控制

对 Kubernetes Pod 进程网络带宽 流量控制

Posted by 董江 on Wednesday, August 10, 2022

技术方案之 Kubernetes Pod进程网络带宽 流量控制

背景

混合云场景业务Pod直接相互干扰在离线混部(在离线服务同时在一台机器上服务用户) 等场景下,除了对cpumemfdinodepid等进行隔离,还需要对 网络带宽bandwidth磁盘读写速度IPOSNBD IOL3 Cache内存带宽MBA 等都需要做到隔离和限制

因此,本章节介绍下 网络带宽bandwidth limit 的使用和实现

Kubernetes 具体使用和实现

cni plugin

容器拉起,是通过运行时接口对底层cni网络插件来生产虚拟网络,bind到容器实现。对容器进行网络限制,底层需要cni网络插件的限制,而cni网络插件 会将网络限制指令,将具体配置提交给 Linux 流量控制 (tc) 子系统,tc 包含一组机制和操作,数据包通过这些机制和操作在网络接口上排队等待传输/接收(令牌桶过滤器TBF),从而达到流量控制

CNI 对 Linux TC 操作

{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.0",  #必须0.3.0 containernetworking plugin 目前最高版本
  "plugins":
    [
      {
        "type": "calico",
        "log_level": "info",
        "datastore_type": "kubernetes",
        "nodename": "127.0.0.1",
        "ipam": { "type": "host-local", "subnet": "usePodCidr" },
        "policy": { "type": "k8s" },
        "kubernetes": { "kubeconfig": "/etc/cni/net.d/calico-kubeconfig" },
      },
      { 
    	"type": "bandwidth", 
    	"capabilities": {     
			"bandwidth": true   #支持cri-o json配置提交
		},
		/* 以下是对cni插件网络限流操作, capabilities和一下4个配置二选一
		"ingressRate": 123,
      	"ingressBurst": 456,
        "egressRate": 123,
        "egressBurst": 456
        */
	  },
    ]
}

cni插件支持本配置,也支持cri-ocontaierddockershim等通过json配置提交

func cmdAdd(args *skel.CmdArgs) error {
	// cni 配置解析
	conf, err := parseConfig(args.StdinData)
	if err != nil {
		return err
	}
    
	//...
	
	// 从配置中活动 ingress Rate和Burst
	if bandwidth.IngressRate > 0 && bandwidth.IngressBurst > 0 {
		// TC TBF 中创建流控规则
		err = CreateIngressQdisc(bandwidth.IngressRate, bandwidth.IngressBurst, hostInterface.Name)
		if err != nil {
			return err
		}
	}
	
	// 从配置中活动 egress Rate和Burst
	if bandwidth.EgressRate > 0 && bandwidth.EgressBurst > 0 {
		// ...
		
		// 对特定本地Device设置出口流控规则
		err = CreateEgressQdisc(bandwidth.EgressRate, bandwidth.EgressBurst, hostInterface.Name, ifbDeviceName)
		if err != nil {
			return err
		}
	}

	return types.PrintResult(result, conf.CNIVersion)
}

OCR 流控配置

通过Pod配置annotations

apiVersion: v1
kind: Pod
  metadata:
    name: iperf-slow
  annotations:
    kubernetes.io/ingress-bandwidth: 10M
    kubernetes.io/egress-bandwidth: 10M
...

Kubenetes 代码支持在 pod annotations解析和使用

kubernetes.io/ingress-bandwidthkubernetes.io/egress-bandwidth 值只是支持 1k-1P, 超过32G需要调整Kernel参数

// 配置值在 1k-1p之间
var minRsrc = resource.MustParse("1k")  
var maxRsrc = resource.MustParse("1P")

// 获取pod annotations并传递给 runc
func ExtractPodBandwidthResources(podAnnotations map[string]string) (ingress, egress *resource.Quantity, err error) {
	if podAnnotations == nil {
		return nil, nil, nil
	}
	str, found := podAnnotations["kubernetes.io/ingress-bandwidth"]
	if found {
		ingressValue, err := resource.ParseQuantity(str)
		if err != nil {
			return nil, nil, err
		}
		ingress = &ingressValue
		if err := validateBandwidthIsReasonable(ingress); err != nil {
			return nil, nil, err
		}
	}
	str, found = podAnnotations["kubernetes.io/egress-bandwidth"]
	if found {
		egressValue, err := resource.ParseQuantity(str)
		if err != nil {
			return nil, nil, err
		}
		egress = &egressValue
		if err := validateBandwidthIsReasonable(egress); err != nil {
			return nil, nil, err
		}
	}
	return ingress, egress, nil
}

contaierd为例, kubelet 活动 pod yaml信息后续,传递给containerd runtime,并继续传递给cni插件

func cniNamespaceOpts(id string, config *runtime.PodSandboxConfig) ([]cni.NamespaceOpts, error) {
	opts := []cni.NamespaceOpts{
		cni.WithLabels(toCNILabels(id, config)),
		cni.WithCapability(annotations.PodAnnotations, config.Annotations),
	}

	portMappings := toCNIPortMappings(config.GetPortMappings())
	if len(portMappings) > 0 {
		opts = append(opts, cni.WithCapabilityPortMap(portMappings))
	}

	// pod annotations中获得配置,最后传递给cni
	bandWidth, err := toCNIBandWidth(config.Annotations)
	if err != nil {
		return nil, err
	}
	if bandWidth != nil {
		opts = append(opts, cni.WithCapabilityBandWidth(*bandWidth))
	}
	// ...
}

验证和测试

流控依赖Linux TC子系统。目前只支持Linux K8s集群

apiVersion: apps/v1
kind: Deployment
metadata:
  name: iperf-server-deployment
  labels:
    app: iperf-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: iperf-server
  template:
    metadata:
      labels:
        app: iperf-server
      #添加注解
      annotations:
        kubernetes.io/ingress-bandwidth: 1M
        kubernetes.io/egress-bandwidth: 1M
    spec:
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
      containers:
      - name: iperf3-server
        image: dongjiang1989/iperf
        args: ['-s', '-p', '5001']
        ports:
        - containerPort: 5001
          name: server
      terminationGracePeriodSeconds: 0

---
    
apiVersion: apps/v1
kind: Deployment
metadata:
  name: iperf-client
  labels:
    app: iperf-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: iperf-client
  template:
    metadata:
      labels:
        app: iperf-client
    spec:
      containers:
      - name: iperf-client
        image: dongjiang1989/iperf
        command: ['/bin/sh', '-c', 'sleep 1d']
      terminationGracePeriodSeconds: 0

对于未添加网络限流注解

$ kubectl get pod | grep iperf 
iperf-client-7874c47d95-t7hph              1/1     Running   0               5m58s
iperf-server-deployment-74d94bdd59-dzdl4   1/1     Running   0               5m58s
kubectl exec iperf-client-7874c47d95-t7hph -- iperf -c 10.1.0.173 -p 5001 -i 10 -t 100
------------------------------------------------------------
Client connecting to 10.1.0.173, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  1] local 10.1.0.172 port 56296 connected with 10.1.0.173 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.00 sec  19.7 GBytes  16.9 Gbits/sec
[  1] 10.00-20.00 sec  18.9 GBytes  16.2 Gbits/sec
[  1] 20.00-30.00 sec  20.0 GBytes  17.2 Gbits/sec
[  1] 30.00-40.00 sec  20.4 GBytes  17.5 Gbits/sec
[  1] 40.00-50.00 sec  18.5 GBytes  15.9 Gbits/sec
[  1] 50.00-60.00 sec  19.3 GBytes  16.5 Gbits/sec
[  1] 60.00-70.00 sec  17.6 GBytes  15.1 Gbits/sec
[  1] 70.00-80.00 sec  17.1 GBytes  14.7 Gbits/sec
[  1] 80.00-90.00 sec  18.4 GBytes  15.8 Gbits/sec
[  1] 90.00-100.00 sec  15.1 GBytes  13.0 Gbits/sec
[  1] 0.00-100.00 sec   185 GBytes  15.9 Gbits/sec

未做限流,Bandwidth可以到15.9Gbits/sec

对于添加网络限流注解

$ kubectl get pod | grep iperf
iperf-clients-rcsh6                        1/1     Running   0          7h7m
iperf-server-deployment-59675c8f78-g52pm   1/1     Running   0          6h52m

$ kubectl exec iperf-clients-rcsh6 -- iperf -c 10.1.0.170 -p 5001 -i 10 -t 100
------------------------------------------------------------
Client connecting to 10.1.0.170, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  1] local 10.1.0.170 port 54652 connected with 10.1.0.170 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.00 sec  3.50 MBytes  2.94 Mbits/sec
[  1] 10.00-20.00 sec  2.25 MBytes  1.89 Mbits/sec
[  1] 20.00-30.00 sec  2.04 MBytes  1.71 Mbits/sec
[  1] 30.00-40.00 sec   892 KBytes   731 Kbits/sec
[  1] 40.00-50.00 sec   954 KBytes   781 Kbits/sec
[  1] 50.00-60.00 sec  1.36 MBytes  1.14 Mbits/sec
[  1] 60.00-70.00 sec  1.18 MBytes   993 Kbits/sec
[  1] 70.00-80.00 sec  87.1 KBytes  71.4 Kbits/sec
[  1] 80.00-90.00 sec  0.000 Bytes  0.000 bits/sec
[  1] 90.00-100.00 sec  2.97 MBytes  2.50 Mbits/sec
[  1] 0.00-100.69 sec  15.5 MBytes  1.29 Mbits/sec

限制1Mbits/sec, 流控真实表现是 1.29 Mbits/sec

为啥限制1Mbits/sec, 流控真实表现略大约1Mbits/sec? 原因:在Linux系统中, 1M = 1024k的; 而 K8s中使用 Resource 对象实现的 1M = 1000k的. 因此,真实 设置 1Mbits/sec 在 Linux 中的表现应该是 1024*1024(bits/sec)/(1000*1000) = 1.048Mbits/sec. 在0-1s之间,TC控制不准确,会有数据平均增大的问题

总结

    1. docker 1.18支持runc runtime json传递;containerd作为runtime, 1.4版本才能支持;
    1. calico需要2.1版本; cilium需要1.12.90版本; kube-ovn需要版本1.9.0版本;但是需要支持
     `ovn.kubernetes.io/ingress_rate` : Ingress 流量的速率限制,单位:Mbits/s
     `ovn.kubernetes.io/egress_rate` : Egress 流量的速率限制,单位:Mbits/s
    
    1. 不能动态更新annotation里面的流量限制大小,更新之后必须删除pod重建;

因此,需要通过webhook来将丰富配置namespcae下的limitrange含义拉齐, 并支持默认填充

具体实现方式

先通过 CRD 描述 namespacelimitrange 扩展限制

设计如下:

apiVersion: custom.xxx.com/v1
kind: CustomLimitRange
metadata:
  name: test-rangelimit
spec:
  limitrange:
    type: pod      # 对pod类型限制,以后扩展到 contianer类型、ingress类型、service类型
    max:           # max和min是限制的上下线,如果pod自定义的值不在其中,ValidatingAdmissionWebhook校验报错
      ingress-bandwidth: "1G"
      egress-bandwidth: "1G"
    min:
      ingress-bandwidth: "10M"
      egress-bandwidth: "10M"
    default:          # 定义了default,如果pod annotation为空,MutatingAdmissionWebhook自动注入此数据;未定义default,不作强注入操作
      ingress-bandwidth: "128M"
      egress-bandwidth: "128M"

pod 可以是支持设置 customlimitrange.kubernetes.io/limited : disable, 可支持 ignore namespace下CustomLimitRange限制

注意 本身CustomLimitRange自身校验必不可少:

  • max value >= default value >= min value
  • value range [1k, 1P] && value 类型 Kbits/sec, Mbits/sec, Gbits/sec, Tbits/sec 和 Pbits/sec
  • type 类型 enum
  • max、min 和 default 可缺省
  • 内部适配:kube-ovn annotation

使用方式

    1. Pod和Deployment添加注解annotation
# Pod
apiVersion: v1
kind: Pod
metadata:
  name: xxxx
  annotations:
    kubernetes.io/ingress-bandwidth: 1M
    kubernetes.io/egress-bandwidth: 1M
...


# Deployment
...
 spec:
  template:
    metadata:
      #添加注解
      annotations:
        kubernetes.io/ingress-bandwidth: 1M
        kubernetes.io/egress-bandwidth: 1M
...
    1. 通过定义Custom LimitRange 自动添加annotation. 如以上

下一章节

diskio blkiodevice IPOS流量控制

「如果这篇文章对你有用,请随意打赏」

Kubeservice博客

如果这篇文章对你有用,请随意打赏

使用微信扫描二维码完成支付