-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[15090] Ensure no leakage of evaluations for batch jobs. #15097
[15090] Ensure no leakage of evaluations for batch jobs. #15097
Conversation
cc: @jrasell |
I need to test this a little bit more, but I believe that the current code actually references relatively random memory which is the result of dereferences of pointers which are serialized and de-serialized with the underlying object being non-static (that is: rellocable). In particular the de-serialization proceeds by using the eval id as an index (https://github.com/hashicorp/nomad/blob/main/nomad/core_sched.go#L294). The GC logic is quite careful not to address pointers in structs except for this one place. EDIT: It seems that my hypothesis was correct. I have deployed it to a test cluster and observed that the correct gc timeouts on evals are now applied to batch jobs and we do not start allocations when we shouldn't. I will modify the fix here and open up for review tomorrow. |
… of allocations that belong to batch jobs. Add Eval GC to batch job GC. Add tests.
Any progress of merging this? Having huge problems in our environment with this! |
At this point we're waiting for Nomad folks to reproduce and reason about
the initial reason for the existing code. Given that it's not uncommon for
bugs to hide bugs, I appreciate the scrutiny. I hope we will hear back soon.
Take a look at the issue attached to the PR for the discussion so far.
…On Tue, Nov 29, 2022, 3:16 AM Daniel Fredriksson ***@***.***> wrote:
Any progress of merging this? Having huge problems in our environment with
this!
—
Reply to this email directly, view it on GitHub
<#15097 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AECDANFJPWCK3JHAF63VKVDWKW3UPANCNFSM6AAAAAARUF57ZY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
…ows us to control the pace of GC of Batch Evals independent of other evals.
any chance of this making it into the 1.4.x series as well? In addition to the 1.5.0 release? |
Sorry for the delay on this... Luiz is on a well-deserved holiday so I'm going to pick up the review from here with Seth. We're juggling a couple other things at the moment but this is definitely important so expect we'll follow through soonish. Thanks for your patience. @shantanugadgil this is a bug fix, so it'll get backported to two major versions (i.e. if we release it in 1.5.0 it'll get backported to 1.4.x and 1.3.x). |
@tgross Thanks for the update. 🙏 Looking forward to the backported releases of 1.4.x. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @stswidwinski! Thanks for your patience on this. I've given it a review and I've left a few comments to tighten up the code that'll hopefully make it easier for the next folks who read this 😁
But I feel pretty good about this approach overall. I'm getting towards the end of my week here so I'm going to give this one more pass early next week and hopefully we can wrap it up then. Thanks!
nomad/core_sched_test.go
Outdated
// An EvalGC should reap allocations from jobs with a newer modify index and reap the eval itself | ||
// if all allocs are reaped. | ||
func TestCoreScheduler_EvalGC_Batch_OldVersionReapsEval(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests have a lot of setup code (as I'm sure you noticed!) and the assertion is only subtly different than the previous one. It might make the whole test more understandable if we collapsed these into a single test and had this one be the final assertion that the eval is GC'd once the conditions are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I have actually spent quite a bit of time unravelling these test structures. The reason why making this into a single test is non-trivial is the way in which the time table is used and the fact that there does not exist a clock whose reading we may set to a particular time.
In essence, time.Now()
is used throughout which always gives the current time without the ability to set the clock. When we insert time events into the time table, we must do so from oldest-to-newest for the time table to allow us to select older events (the sorting logic is: "give me the last event that has occurred past time X").
As such, I resorted to forceful reset of the time table for each logical test sharing the setup.
Please see the changes.
Sounds great. Acking the comment and I'll come back to fix up the code
early next week.
…On Fri, Dec 9, 2022, 2:52 PM Tim Gross ***@***.***> wrote:
***@***.**** commented on this pull request.
Hi @stswidwinski <https://github.com/stswidwinski>! Thanks for your
patience on this. I've given it a review and I've left a few comments to
tighten up the code that'll hopefully make it easier for the next folks who
read this 😁
But I feel pretty good about this approach overall. I'm getting towards
the end of my week here so I'm going to give this one more pass early next
week and hopefully we can wrap it up then. Thanks!
------------------------------
In nomad/core_sched.go
<#15097 (comment)>:
> +// olderVersionTerminalAllocs returns a tuplie ([]string, bool). The first element is the list of
+// terminal allocations which may be garbage collected for batch jobs. The second element indicates
+// whether or not the allocation itself may be garbage collected.
⬇️ Suggested change
-// olderVersionTerminalAllocs returns a tuplie ([]string, bool). The first element is the list of
-// terminal allocations which may be garbage collected for batch jobs. The second element indicates
-// whether or not the allocation itself may be garbage collected.
+// olderVersionTerminalAllocs returns a tuple ([]string, bool). The first element is the list of
+// terminal allocations which may be garbage collected for batch jobs. The second element indicates
+// whether or not the evaluation itself may be garbage collected.
Also, the caller can determine if its safe to GC the eval because the
length of the allocs parameter has to match the length of the return
value, right? And if there are no allocs left, is it safe to just go ahead
and GC it without calling this function?
------------------------------
In command/agent/config.go
<#15097 (comment)>:
> EvalGCThreshold string `hcl:"eval_gc_threshold"`
+ // BatchEvalGCThreshold controls how "old" an evaluation must be to be eligible
+ // for GC if the eval belongs to a batch job.
+ BatchEvalGCThreshold string `hcl:"batch_eval_gc_threshold"`
There's a ParseDuration call we'll need to make in command/agent.go as
well. See the equivalent for EvalGCThreshold at agent.go#L359-L365
<https://github.com/hashicorp/nomad/blob/v1.4.3/command/agent/agent.go#L359-L365>
for an example
------------------------------
In nomad/core_sched_test.go
<#15097 (comment)>:
> + if err != nil {
+ t.Fatalf("err: %v", err)
+ }
+ if out == nil {
+ t.Fatalf("bad: %v", out)
+ }
We're migrating over to https://github.com/shoenig/test as we touch old
tests and write new ones. For all the new assertions, let's use that.
⬇️ Suggested change
- if err != nil {
- t.Fatalf("err: %v", err)
- }
- if out == nil {
- t.Fatalf("bad: %v", out)
- }
+ must.NoError(t, err)
+ must.NotNil(t, out)
------------------------------
In nomad/core_sched_test.go
<#15097 (comment)>:
> if err != nil {
t.Fatalf("err: %v", err)
}
- // Update the time tables to make this work
- tt := s1.fsm.TimeTable()
- tt.Witness(2000, time.Now().UTC().Add(-1*s1.config.EvalGCThreshold))
+ // A little helper for assertions
+ assertCorrectEvalAlloc := func(
+ ws memdb.WatchSet,
+ eval *structs.Evaluation,
+ allocsShouldExist []*structs.Allocation,
+ allocsShouldNotExist []*structs.Allocation,
+ ) {
This makes test outputs a lot more legible if a test fails:
⬇️ Suggested change
- ) {
+ ) {
+ t.Helper()
------------------------------
In nomad/core_sched_test.go
<#15097 (comment)>:
>
- alloc3.Job = job2
- alloc3.JobID = job2.ID
+ // Insert allocs with indexes older than job.ModifyIndex. Two cases:
For clarity, because of what we'll actually check:
⬇️ Suggested change
- // Insert allocs with indexes older than job.ModifyIndex. Two cases:
+ // Insert allocs with indexes older than job.ModifyIndex and job.JobModifyIndex. Two cases:
------------------------------
In nomad/core_sched_test.go
<#15097 (comment)>:
> +// An EvalGC should reap allocations from jobs with a newer modify index and reap the eval itself
+// if all allocs are reaped.
+func TestCoreScheduler_EvalGC_Batch_OldVersionReapsEval(t *testing.T) {
These tests have a *lot* of setup code (as I'm sure you noticed!) and the
assertion is only subtly different than the previous one. It might make the
whole test more understandable if we collapsed these into a single test and
had this one be the final assertion that the eval is GC'd once the
conditions are right.
------------------------------
In nomad/core_sched.go
<#15097 (comment)>:
> for _, alloc := range allocs {
- if alloc.Job != nil && alloc.Job.CreateIndex < job.CreateIndex && alloc.TerminalStatus() {
+ if alloc.CreateIndex < job.JobModifyIndex && alloc.ModifyIndex < thresholdIndex && alloc.TerminalStatus() {
I think so. (*Job).JobModifyIndex is only updated when the user-facing
version of the job is updated (as opposed to (*Job).ModifyIndex which is
updated whenever the job is updated in raft for any reason, like being
marked complete or whatever.
—
Reply to this email directly, view it on GitHub
<#15097 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AECDANCO2YJDFUGIJEN554LWMOEYLANCNFSM6AAAAAARUF57ZY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I've actually found a bug in the logic in which we would apply the regular GC timeout first and not GC evaluations that should be GCed if |
@stswidwinski is attempting to deploy a commit to the HashiCorp Team on Vercel. A member of the Team first needs to authorize it. |
evals in face of purge.
e63cbc1
to
4609d8e
Compare
This sounds great. I cherrypicked the changes on top. |
Note: I'll add the changelog entries soon :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pulled down this branch and re-ran the bench testing I did previously. LGTM! Thanks for your patience in seeing this one through @stswidwinski!
🎆 Thank you for bearing with me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work @stswidwinski!
I pushed a commit to expand the changelog entry a bit more since this will be a breaking change.
For the upgrade note, I will add it in a separate PR since this one will be backported.
Sorry, just noticed a mistake in my changes 😅 I pushed a commit to fix it. |
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`. Co-authored-by: stswidwinski <[email protected]>
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`. Co-authored-by: stswidwinski <[email protected]>
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in
false
, which leads to memory leakage as reported here:#15090
#14842
#4532
The comments within the code and tests are self-contradictory. Some state that evals of batch jobs should never be GCed others claim that they should be GCed if a new job version is created (as determined by comparing the indexes). I believe that the latter is the expected state.
Thus, we compare the creation index of the allocation with the modification index of the job to determine whether or not an alloc belongs to the current job version or not.
This pull request specifically:
alloc.Job
which is unsafe given the deserialization of allocs from memDB which may include invalid pointersAny and all feedback welcome.
For more details, please see the issues linked. Specifically #15090 describes the issue in detail.
EDIT:
I am assuming here that an update to the job results in an eval and consequently allocs for the job. If this is not correct than we must take into account the ordering of evals/allocs which has not been done prior.