-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invoke DELETE on pod prepare-downscale path if any POSTs failed #146
Invoke DELETE on pod prepare-downscale path if any POSTs failed #146
Conversation
resp, err := client.Do(req) | ||
if err != nil { | ||
level.Error(logger).Log("msg", fmt.Sprintf("error sending HTTP %s request", method), "err", err) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line was absent from this routine before. I'm 90% sure that was a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean line 477 above, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Ah yeah, it was the one after http.NewRequest, not client.Do.
rollout-operator/pkg/admission/prep_downscale.go
Lines 481 to 484 in ea17193
req, err := http.NewRequestWithContext(ctx, http.MethodPost, "http://"+ep.url, nil) | |
if err != nil { | |
level.Error(logger).Log("msg", "error creating HTTP POST request", "err", err) | |
} |
Let's wait for @pstibrany review here. He's the expert in this area. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple nits, but otherwise LGTM. Thank you!
I'd also like @pstibrany to take a look to make sure it doesn't conflict with anything he's been working on.
|
||
logger.SetSpanAndLogTag("url", ep.url) | ||
logger.SetSpanAndLogTag("index", ep.index) | ||
const maxGoroutines = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just to keep things in check, or is there a specific reason for 32
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing magic about 32, just to smooth out potentially spiky network IO.
resp, err := client.Do(req) | ||
if err != nil { | ||
level.Error(logger).Log("msg", fmt.Sprintf("error sending HTTP %s request", method), "err", err) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean line 477 above, right?
pkg/admission/prep_downscale_test.go
Outdated
assert.NoError(t, err) | ||
} | ||
if c.lastPostsFail > 0 { | ||
assert.Greater(t, postCalls.Load(), int32(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, we can't test against a specific value here because once the errGroup context is cancelled we'll stop sending POSTs entirely, and we don't know how many will have been in flight (and return errors) before that happens, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back in and tightened this up a little more, as we can expect at least endpoints-failures
POSTs. But yep you are right about why we can't do an equality check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you.
I checked how Mimir ingester behaves on DELETE method when using ingest storage, and it simply doesn't support that option (returns error), and will stay "prepared for downscale", which means it will unregister from rings and flush storage. When using ingest storage, prepare-for-shutdown should only be called after period in which ingester was not receiving any data for some time already. If pod comes back, it will register back and activate the partition. Overall, I think this is fine.
g, ectx := errgroup.WithContext(ctx) | ||
g.SetLimit(maxGoroutines) | ||
for _, ep := range eps { | ||
ep := ep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need this in go 1.22 anymore (we use go 1.22 in rollout-operator). same comment in next for-loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a little paranoid as I know others are using this tool. We specify 1.22 in our go.mod but IIUC you can still build and run with an older compiler. And if you don't run the tests it'll probably be race central. :)
Unless you know something that can lessen my paranoia, I think I'll leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We specify 1.22 in our go.mod but IIUC you can still build and run with an older compiler.
It used to be like that, but now it's mandatory. From https://go.dev/doc/modules/gomod-ref#go
The go directive sets the minimum version of Go required to use this module. Before Go 1.21, the directive was advisory only; now it is a mandatory requirement: Go toolchains refuse to use modules declaring newer Go versions.
Add a changelog entry for #146, and prepare changelog for v0.16.0. Co-authored-by: Patryk Prus <[email protected]> --------- Co-authored-by: Patryk Prus <[email protected]>
This addresses a bug in rollout-operator where:
X
hosts.X
pods for shutdown by sending an HTTPPOST
to their handler identified by thegrafana.com/prepare-downscale-http-path
and-port
annotations.This PR adds cleanup logic to issue
DELETE
requests on all involved pods if any of thePOST
s failed. Notes:DELETE
calls are attempted once.DELETE
failures are logged but otherwise ignored.DELETE
on all of the pods involved in the scaledown operation, not just ones that received a POST.This doesn't fix the similar issue where replica count changing from 10->9->10 leaves that one pod prepared for shutdown. (But that's in the works.)