-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
state store: better handling of job deletion #19609
Conversation
1365183
to
fafb300
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks generally good but it's worth trying to reason through what happens if there are allocations on a client that's disconnected when the job is purged. I'm not sure that it's safe to delete allocations until we know they're terminal?
If these are failed allocs, they remain in
Yeah so this to me is more about the intended behavior of the The behavior introduced in this PR makes one unfortunate side-effect, that is, if a client is disconnected while the job gets purged, and if it's a failing job (with |
e0ef8fd
to
62bbe3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments, but one general observation is that we usually have DeleteX()
and DeleteXTxn()
methods. I think it would nice to keep this pattern here for deployments and allocs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once @lgfa29's comments are resolved
Not sure I understand, are you suggesting to rename |
Nope, I was suggesting having func (s *StateStore) DeleteJobTxn(index uint64, namespace, jobID string, txn Txn) error {
// ...
for _, deployment := range deployments {
err := s.DeleteDeploymentTxn(txn, deployment)
// ...
}
// ....
for _, alloc := range allocs {
err := s.DeleteAllocTxn(txn, alloc)
// ...
}
} That way the deployment and alloc delete logic is shared and consistent across methods (for example, you wouldn't need to update their index table in |
oooh I get it now I think. Implemented in e918e3b, have a look if that's what you meant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last bit of nit-picking 😄
There's no need to wait for allocs since #19609, in fact waiting for allocs to stop will always fail leading to e2e failures.
When jobs are deleted with
-purge
, all their deployments and allocations should be deleted from the state store, and the evals status should be set tocomplete
. Otherwise we end up in a situation where users could re-submit previously failing jobs, but these new jobs would not get deployments allocated unlesssystem gc
got called.Fixes #10502