TIPS之 Volcano jobflow 状态混乱排查

Volcano jobflow 状态混乱排查

Posted by 董江 on Thursday, March 13, 2025

Volcano jobflow 状态混乱排查

背景

当多个 JobFlow 并发执行且都引用同一个 JobTemplate 模板时,每个 JobFlow 状态中出现多个 JobFlow 和多个 VcJob 信息是比较常见的情况,以下为你详细解释:

  • 多个 JobFlow 信息

    每个并发的 JobFlow 都有自己独立的身份标识、执行路径、状态等信息。虽然它们引用了相同的 JobTemplate 模板,但在实际执行时是相互独立的个体,各自有自己的生命周期和执行情况,所以在相关状态记录中会分别列出每个 JobFlow 的详细信息,以便对它们进行单独的监控和管理。

  • 多个 VcJob 信息

    VcJob 通常是与 JobFlow 相关的子任务或具体的执行单元。当多个 JobFlow 并发执行时,每个 JobFlow 可能会包含多个 VcJob,这是因为 JobFlow 可能被设计为包含一系列不同的操作或阶段,每个阶段可能对应一个或多个 VcJob。

例如,一个 JobFlow 可能涉及数据读取、数据处理和结果输出三个阶段,每个阶段就是一个 VcJob。不同的 JobFlow 虽然引用相同模板,但它们各自的 VcJob 在执行时间、执行状态等方面可能会有所不同,因此需要分别记录每个 JobFlow 下的多个 VcJob 信息,以便准确跟踪和掌握整个任务的执行情况。

错误信息

$ kubectl describe jf test-b
Name:         test-b
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  flow.volcano.sh/v1alpha1
Kind:         JobFlow
Metadata:
  Creation Timestamp:  2025-03-13T09:59:06Z
  Generation:          1
  Resource Version:    8057
  UID:                 0c67033b-f831-487c-8ca2-fb4ee1ff1892
Spec:
  Flows:
    Name:  a
    Depends On:
      Targets:
        a
    Name:  b
    Depends On:
      Targets:
        b
    Name:  c
    Depends On:
      Targets:
        b
    Name:  d
    Depends On:
      Targets:
        c
        d
    Name:             e
  Job Retain Policy:  retain
Status:
  Completed Jobs:     ---> muti jobflow info in this
    test-a-a
    test-a-c
    test-b-c
    test-b-d
    test-a-e
    test-b-e
    test-b-a
    test-a-b
    test-b-b
    test-a-d
  Conditions:
    Test - A - A:
.....

问题复现

1. 创建job模板列表

apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
  name: a
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    ssh: []
    env: []
    svc: []
  maxRetry: 5
  queue: default
  tasks:
    - replicas: 1
      name: "default-nginx"
      template:
        metadata:
          name: web
        spec:
          containers:
            - image: nginx:1.14-alpine
              command:
                - sh
                - -c
                - sleep 60s
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                requests:
                  cpu: 64m
          restartPolicy: OnFailure
---
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
  name: b
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    ssh: []
    env: []
    svc: []
  maxRetry: 5
  queue: default
  tasks:
    - replicas: 1
      name: "default-nginx"
      template:
        metadata:
          name: web
        spec:
          containers:
            - image: nginx:1.14-alpine
              command:
                - sh
                - -c
                - sleep 40s
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                requests:
                  cpu: 100m
          restartPolicy: OnFailure
---
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
  name: c
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    ssh: []
    env: []
    svc: []
  maxRetry: 5
  queue: default
  tasks:
    - replicas: 1
      name: "default-nginx"
      template:
        metadata:
          name: web
        spec:
          containers:
            - image: nginx:1.14-alpine
              command:
                - sh
                - -c
                - sleep 30s
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                requests:
                  cpu: 64m
          restartPolicy: OnFailure
---
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
  name: d
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    ssh: []
    env: []
    svc: []
  maxRetry: 5
  queue: default
  tasks:
    - replicas: 1
      name: "default-nginx"
      template:
        metadata:
          name: web
        spec:
          containers:
            - image: nginx:1.14-alpine
              command:
                - sh
                - -c
                - sleep 10s
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                requests:
                  cpu: 64m
          restartPolicy: OnFailure
---
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
  name: e
spec:
  minAvailable: 1
  schedulerName: volcano
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    ssh: []
    env: []
    svc: []
  maxRetry: 5
  queue: default
  tasks:
    - replicas: 1
      name: "default-nginx"
      template:
        metadata:
          name: web
        spec:
          containers:
            - image: nginx:1.14-alpine
              command:
                - sh
                - -c
                - sleep 10s
              imagePullPolicy: IfNotPresent
              name: nginx
              resources:
                requests:
                  cpu: 100m
          restartPolicy: OnFailure

2. 创建一个作业流,名称:test-a

apiVersion: flow.volcano.sh/v1alpha1
kind: JobFlow
metadata:
  name: test-a
  namespace: default
spec:
  jobRetainPolicy: retain  # retain or delete
  flows:
    - name: a
    - name: b
      dependsOn:
        targets: ['a']
    - name: c
      dependsOn:
        targets: ['b']
    - name: d
      dependsOn:
        targets: ['b']
    - name: e
      dependsOn:
        targets: ['c','d']

3. 同时,创建一个jobflow,名称:test-b

apiVersion: flow.volcano.sh/v1alpha1
kind: JobFlow
metadata:
  name: test-b
  namespace: default
spec:
  jobRetainPolicy: retain  # retain or delete
  flows:
    - name: a
    - name: b
      dependsOn:
        targets: ['a']
    - name: c
      dependsOn:
        targets: ['b']
    - name: d
      dependsOn:
        targets: ['b']
    - name: e
      dependsOn:
        targets: ['c','d']

现象: Jobflow BJobflow A创建的vcjob列表,在Jobflow B和Jobflow A的状态中,全部/部分混合在一起。

解决

controller在同一个namespace下获取不同jobflows下的vcjob列表时出错。

 var flowNames []string 
 for _, flow := range jobFlow.Spec.Flows { 
 	flowNames = append(flowNames, GenerateObjectString(jobFlow.Namespace, flow.Name)) 
 } 
 selector := labels.NewSelector() 
 r, err := labels.NewRequirement(CreatedByJobTemplate, selection.In, flowNames) 

当通过该标签获取同一个namespace下的多个vcjob(由同一个jobTemplate创建,但是来自不同jobflow下的jobTemplates)时volcano.sh/createdByJobTemplate,会获取到不同jobflow下的vcjob实例。

其他

「如果这篇文章对你有用,请随意打赏」

Kubeservice博客

如果这篇文章对你有用,请随意打赏

使用微信扫描二维码完成支付