-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simpler top level status fields for total, succeeded and failed jobs #545
Comments
Is it not possible to use Conditions for this purpose? We emit a conditon with the job is successful or failed and you could check that. |
The reason we have an array of this is because replicated jobs can all finish at different times so we can't really flatten the array to report jobs without that information. |
In some cases, it is not possible to use array notations and filters for specific conditions. In the case of Argo workflows, they use a label selector to parse the success and failure conditions (which I see only works with Is it possible to have |
The concern is that I think that would be insufficient for all use cases. If you had a lot of replicated jobs you may have some of the jobs finishing while others are not. say JobSet with rep a, rep b, and rep c jobs starting in parallel. User really wants all of rep c to finish to declare the job successful but rep a and rep b finish and user uses a count based approach. This would end up with a case where you would declare the job successfully (if totalJobs is greater than value suggested). If one specified success_jobs = 2. Other counter point is that we don't require certain replicated jobs to finish (say driver/worker where workers finish). In this case our job is successfully but the driver doesn't technically finish. I think the condition is going to be the reliable approach. |
@ahg-g @danielvegamyhre WDYT? |
The goal wouldn't be to remove existing |
Maybe we can have a jobset's own status, so that argo workflow can also use this status to determine whether it is completed. NAME RESTARTS COMPLETED SUSPENDED AGE. Status
failure-policy 3 67m Running
network-jobset 54m Succeed
paralleljobs True 55m. Failed |
Sure.. Having a Jobset's own top level status field simply indicating whether it is Running, Succeeded, Failed (or |
I think a phase field in status would be useful. I was surprised we don't actually have one. WDYT of adding a phase field in JobSet.Status that has the options of "Running, Succeeded, Failed or Suspended"? |
Yeah I agree. This would be useful for higher level tools which use JobSet as a building block, like XPK. |
Nice idea @googs1025! |
Conditions is the new "Phase". But if phase makes it easier to consume, I am fine with that. |
If there is a jobset with multiple jobs and say some of them are running while the others have failed, what would the status of the jobset be: Running? Failed? One benefit of simply having the numbers (totalJobs, succeeded, failed) is that it might slightly less ambiguous and it leaves the caller to decide how to interpret the numbers. And if they care specifically about which job succeeded or failed, they can look at |
It would depend on the success policy and failure policy defined in the JobSet. Defaults are:
These policies are configurable though. |
basically phase will be set once the JobSet reaches a terminal state |
@ahg-g is correct. "Conditions" are the reliable option for most kubernetes controllers. Job uses conditions to detect certain states and other workflows support that. It would be ideal to work with argo-workflows on supporting conditions for extended resources. For JobSet I think a phase is fine but I can't say the same for other projects. Happy to support the jobset change though. |
Just to make this concrete, the new status field will be called
|
@shrinandj Are you planning on merging the jobset integration with argo workflow upstream? I am looking into implementing something similar. |
I haven't been able to get to this yet. If you want to go ahead with the implementation, I can help review it and test it and provide any other feedback. Just to ensure we're on the same page: Argo can create any K8s resources based on a yaml/json file. But it needs a way to easily read back the status of the created object. This issue is to make that possible by having a simple high level status. So, by "jobset integration with argo workflow", you mean adding this status field in the jobset object, correct? |
I am supportive of this change if someone would like to take on this task, and we can track it for 0.7.0 |
@jessicaxiejw Let me know when you have something ready to test. I can certainly take it out for a spin and provide feedback. |
@jessicaxiejw if you want to implement this that would be super helpful! Please go ahead and assign yourself if you plan on working on this, or let us know if not and we can find another owner. |
Sorry about the late response. I am not working on it right now 😅 @shrinandj feel free taking a stab at it. We decided to go with a completely different direction. |
/assign |
I want to try it. // the status of JobSet
const (
// JobSetRunning means the job is running.
JobSetRunning JobSetConditionType = "Running"
// JobSetCompleted means the job has completed its execution.
JobSetCompleted JobSetConditionType = "Completed"
// JobSetFailed means the job has failed its execution.
JobSetFailed JobSetConditionType = "Failed"
// JobSetSuspended means the job is suspended.
JobSetSuspended JobSetConditionType = "Suspended"
) |
What would you like to be added:
Currently, the
status
of a jobset object consists only of two arrays:conditions
andreplicatedJobsStatus
. This is great for providing detailed status of individual conditions and replicated jobs.However, it makes referencing the overall status of the jobset a little harder.
Along with the current
conditions
andreplicatedJobsStatus
fields, it will be useful to have at least 3 additional simple fields in the status indicating the overall status of the jobset, similar to K8s jobs:Why is this needed:
We were trying to run Jobsets via Argo-workflows as part of a distributed training setup. Argo-workflows can create Kubernetes resources. In order to do this correctly, Argo allows users to specify a
successCondition
andfailureCondition
. These conditions need to be simplekey=value
conditions. Referencing array elements, filtering for specific conditions, etc. is not possible.If the new fields are available, adding them to Argo or even using any other K8s tools would become that much easier:
The text was updated successfully, but these errors were encountered: