Volcano jobflow 状态混乱排查
背景
当多个 JobFlow
并发执行且都引用同一个 JobTemplate
模板时,每个 JobFlow
状态中出现多个 JobFlow
和多个 VcJob
信息是比较常见的情况,以下为你详细解释:
-
多个 JobFlow 信息
每个并发的 JobFlow 都有自己独立的身份标识、执行路径、状态等信息。虽然它们引用了相同的 JobTemplate 模板,但在实际执行时是相互独立的个体,各自有自己的生命周期和执行情况,所以在相关状态记录中会分别列出每个 JobFlow 的详细信息,以便对它们进行单独的监控和管理。
-
多个 VcJob 信息
VcJob 通常是与 JobFlow 相关的子任务或具体的执行单元。当多个 JobFlow 并发执行时,每个 JobFlow 可能会包含多个 VcJob,这是因为 JobFlow 可能被设计为包含一系列不同的操作或阶段,每个阶段可能对应一个或多个 VcJob。
例如,一个 JobFlow 可能涉及数据读取、数据处理和结果输出三个阶段,每个阶段就是一个 VcJob。不同的 JobFlow 虽然引用相同模板,但它们各自的 VcJob 在执行时间、执行状态等方面可能会有所不同,因此需要分别记录每个 JobFlow 下的多个 VcJob 信息,以便准确跟踪和掌握整个任务的执行情况。
错误信息
$ kubectl describe jf test-b
Name: test-b
Namespace: default
Labels: <none>
Annotations: <none>
API Version: flow.volcano.sh/v1alpha1
Kind: JobFlow
Metadata:
Creation Timestamp: 2025-03-13T09:59:06Z
Generation: 1
Resource Version: 8057
UID: 0c67033b-f831-487c-8ca2-fb4ee1ff1892
Spec:
Flows:
Name: a
Depends On:
Targets:
a
Name: b
Depends On:
Targets:
b
Name: c
Depends On:
Targets:
b
Name: d
Depends On:
Targets:
c
d
Name: e
Job Retain Policy: retain
Status:
Completed Jobs: ---> muti jobflow info in this
test-a-a
test-a-c
test-b-c
test-b-d
test-a-e
test-b-e
test-b-a
test-a-b
test-b-b
test-a-d
Conditions:
Test - A - A:
.....
问题复现
1. 创建job模板列表
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
name: a
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
plugins:
ssh: []
env: []
svc: []
maxRetry: 5
queue: default
tasks:
- replicas: 1
name: "default-nginx"
template:
metadata:
name: web
spec:
containers:
- image: nginx:1.14-alpine
command:
- sh
- -c
- sleep 60s
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
cpu: 64m
restartPolicy: OnFailure
---
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
name: b
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
plugins:
ssh: []
env: []
svc: []
maxRetry: 5
queue: default
tasks:
- replicas: 1
name: "default-nginx"
template:
metadata:
name: web
spec:
containers:
- image: nginx:1.14-alpine
command:
- sh
- -c
- sleep 40s
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
cpu: 100m
restartPolicy: OnFailure
---
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
name: c
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
plugins:
ssh: []
env: []
svc: []
maxRetry: 5
queue: default
tasks:
- replicas: 1
name: "default-nginx"
template:
metadata:
name: web
spec:
containers:
- image: nginx:1.14-alpine
command:
- sh
- -c
- sleep 30s
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
cpu: 64m
restartPolicy: OnFailure
---
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
name: d
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
plugins:
ssh: []
env: []
svc: []
maxRetry: 5
queue: default
tasks:
- replicas: 1
name: "default-nginx"
template:
metadata:
name: web
spec:
containers:
- image: nginx:1.14-alpine
command:
- sh
- -c
- sleep 10s
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
cpu: 64m
restartPolicy: OnFailure
---
apiVersion: flow.volcano.sh/v1alpha1
kind: JobTemplate
metadata:
name: e
spec:
minAvailable: 1
schedulerName: volcano
policies:
- event: PodEvicted
action: RestartJob
plugins:
ssh: []
env: []
svc: []
maxRetry: 5
queue: default
tasks:
- replicas: 1
name: "default-nginx"
template:
metadata:
name: web
spec:
containers:
- image: nginx:1.14-alpine
command:
- sh
- -c
- sleep 10s
imagePullPolicy: IfNotPresent
name: nginx
resources:
requests:
cpu: 100m
restartPolicy: OnFailure
2. 创建一个作业流,名称:test-a
apiVersion: flow.volcano.sh/v1alpha1
kind: JobFlow
metadata:
name: test-a
namespace: default
spec:
jobRetainPolicy: retain # retain or delete
flows:
- name: a
- name: b
dependsOn:
targets: ['a']
- name: c
dependsOn:
targets: ['b']
- name: d
dependsOn:
targets: ['b']
- name: e
dependsOn:
targets: ['c','d']
3. 同时,创建一个jobflow,名称:test-b
apiVersion: flow.volcano.sh/v1alpha1
kind: JobFlow
metadata:
name: test-b
namespace: default
spec:
jobRetainPolicy: retain # retain or delete
flows:
- name: a
- name: b
dependsOn:
targets: ['a']
- name: c
dependsOn:
targets: ['b']
- name: d
dependsOn:
targets: ['b']
- name: e
dependsOn:
targets: ['c','d']
现象:
Jobflow B
和Jobflow A
创建的vcjob列表,在Jobflow B和Jobflow A的状态中,全部/部分混合在一起。
解决
controller在同一个namespace下获取不同jobflows下的vcjob列表时出错。
var flowNames []string
for _, flow := range jobFlow.Spec.Flows {
flowNames = append(flowNames, GenerateObjectString(jobFlow.Namespace, flow.Name))
}
selector := labels.NewSelector()
r, err := labels.NewRequirement(CreatedByJobTemplate, selection.In, flowNames)
当通过该标签获取同一个namespace下的多个vcjob(由同一个jobTemplate创建,但是来自不同jobflow下的jobTemplates)时volcano.sh/createdByJobTemplate,会获取到不同jobflow下的vcjob实例。
其他
「如果这篇文章对你有用,请随意打赏」
如果这篇文章对你有用,请随意打赏
使用微信扫描二维码完成支付
