Backport of [15090] Ensure no leakage of evaluations for batch jobs. into release/1.4.x #15989
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport
This PR is auto-generated from #15097 to be assessed for backporting due to the inclusion of the label backport/1.4.x.
The below text is copied from the body of the original PR.
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in
false
, which leads to memory leakage as reported here:#15090
#14842
#4532
The comments within the code and tests are self-contradictory. Some state that evals of batch jobs should never be GCed others claim that they should be GCed if a new job version is created (as determined by comparing the indexes). I believe that the latter is the expected state.
Thus, we compare the creation index of the allocation with the modification index of the job to determine whether or not an alloc belongs to the current job version or not.
This pull request specifically:
alloc.Job
which is unsafe given the deserialization of allocs from memDB which may include invalid pointersAny and all feedback welcome.
For more details, please see the issues linked. Specifically #15090 describes the issue in detail.
EDIT:
I am assuming here that an update to the job results in an eval and consequently allocs for the job. If this is not correct than we must take into account the ordering of evals/allocs which has not been done prior.