Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of [15090] Ensure no leakage of evaluations for batch jobs. into release/1.3.x #15988

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #15097 to be assessed for backporting due to the inclusion of the label backport/1.3.x.

WARNING automatic cherry-pick of commits failed. Commits will require human attention.

The below text is copied from the body of the original PR.


Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to memory leakage as reported here:

#15090
#14842
#4532

The comments within the code and tests are self-contradictory. Some state that evals of batch jobs should never be GCed others claim that they should be GCed if a new job version is created (as determined by comparing the indexes). I believe that the latter is the expected state.

Thus, we compare the creation index of the allocation with the modification index of the job to determine whether or not an alloc belongs to the current job version or not.

This pull request specifically:

  1. Removes references to alloc.Job which is unsafe given the deserialization of allocs from memDB which may include invalid pointers
  2. Fixes the logical unity described above (compare creation with modification and not creation with creation)
  3. Fixes the test (the test breaks the assumption of alloc reparenting, thus testing the mock rather than production code)
  4. Adds more tests

Any and all feedback welcome.

For more details, please see the issues linked. Specifically #15090 describes the issue in detail.

EDIT:
I am assuming here that an update to the job results in an eval and consequently allocs for the job. If this is not correct than we must take into account the ordering of evals/allocs which has not been done prior.

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/15090-infinite-memory-growth/terribly-integral-pigeon branch 2 times, most recently from 9289fd1 to f011c23 Compare January 31, 2023 18:34
Prior to 2409f72 the code compared the modification index of a job to
itself. Afterwards, the code compared the creation index of the job to
itself. In either case there should never be a case of re-parenting of allocs
causing the evaluation to trivially always result in false, which leads to
unreclaimable memory.

Prior to this change allocations and evaluations for batch jobs were never
garbage collected until the batch job was explicitly stopped. The new
`batch_eval_gc_threshold` server configuration controls how often they are
collected. The default threshold is `24h`.
@tgross tgross force-pushed the backport/15090-infinite-memory-growth/terribly-integral-pigeon branch from f011c23 to 8945aaf Compare January 31, 2023 18:42
@tgross
Copy link
Member

tgross commented Jan 31, 2023

Fixed the backport by bringing the getThreshold helper back to 1.3.x

@tgross tgross merged commit 691a904 into release/1.3.x Jan 31, 2023
@tgross tgross deleted the backport/15090-infinite-memory-growth/terribly-integral-pigeon branch January 31, 2023 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants