Invoke DELETE on pod prepare-downscale path if any POSTs failed #146

seizethedave · 2024-05-09T22:28:40Z

This addresses a bug in rollout-operator where:

Kubernetes receives a request to downscale a statefulset by X hosts.
The prepare-downscale admission webhook attempts to prepare X pods for shutdown by sending an HTTP POST to their handler identified by the grafana.com/prepare-downscale-http-path and -port annotations.
At least one of these requests fails. The admission webhook returns an error to Kubernetes, so the downscale is not approved.
💥 But some hosts may have been prepared for downscale. 💥

This PR adds cleanup logic to issue DELETE requests on all involved pods if any of the POSTs failed. Notes:

DELETE calls are attempted once.
DELETE failures are logged but otherwise ignored.
For simplicity, we'll invoke DELETE on all of the pods involved in the scaledown operation, not just ones that received a POST.

This doesn't fix the similar issue where replica count changing from 10->9->10 leaves that one pod prepared for shutdown. (But that's in the works.)

seizethedave · 2024-05-09T22:31:05Z

pkg/admission/prep_downscale.go

+	resp, err := client.Do(req)
+	if err != nil {
+		level.Error(logger).Log("msg", fmt.Sprintf("error sending HTTP %s request", method), "err", err)
+		return err


this line was absent from this routine before. I'm 90% sure that was a bug.

You mean line 477 above, right?

👍 Ah yeah, it was the one after http.NewRequest, not client.Do.

rollout-operator/pkg/admission/prep_downscale.go

Lines 481 to 484 in ea17193

req, err := http.NewRequestWithContext(ctx, http.MethodPost, "http://"+ep.url, nil)

if err != nil {

level.Error(logger).Log("msg", "error creating HTTP POST request", "err", err)

}

pracucci · 2024-05-11T06:21:03Z

Let's wait for @pstibrany review here. He's the expert in this area.

pr00se

A couple nits, but otherwise LGTM. Thank you!

I'd also like @pstibrany to take a look to make sure it doesn't conflict with anything he's been working on.

pr00se · 2024-05-14T23:12:21Z

pkg/admission/prep_downscale.go


-			logger.SetSpanAndLogTag("url", ep.url)
-			logger.SetSpanAndLogTag("index", ep.index)
+	const maxGoroutines = 32


Is this just to keep things in check, or is there a specific reason for 32?

Nothing magic about 32, just to smooth out potentially spiky network IO.

pkg/admission/prep_downscale.go

pr00se · 2024-05-14T23:43:49Z

pkg/admission/prep_downscale.go

+	resp, err := client.Do(req)
+	if err != nil {
+		level.Error(logger).Log("msg", fmt.Sprintf("error sending HTTP %s request", method), "err", err)
+		return err


You mean line 477 above, right?

pr00se · 2024-05-14T23:51:06Z

pkg/admission/prep_downscale_test.go

+				assert.NoError(t, err)
+			}
+			if c.lastPostsFail > 0 {
+				assert.Greater(t, postCalls.Load(), int32(0))


If I understand correctly, we can't test against a specific value here because once the errGroup context is cancelled we'll stop sending POSTs entirely, and we don't know how many will have been in flight (and return errors) before that happens, right?

I went back in and tightened this up a little more, as we can expect at least endpoints-failures POSTs. But yep you are right about why we can't do an equality check.

pkg/admission/prep_downscale_test.go

pstibrany

Looks good, thank you.

I checked how Mimir ingester behaves on DELETE method when using ingest storage, and it simply doesn't support that option (returns error), and will stay "prepared for downscale", which means it will unregister from rings and flush storage. When using ingest storage, prepare-for-shutdown should only be called after period in which ingester was not receiving any data for some time already. If pod comes back, it will register back and activate the partition. Overall, I think this is fine.

pkg/admission/prep_downscale.go

pstibrany · 2024-05-15T09:30:21Z

pkg/admission/prep_downscale.go

+	g, ectx := errgroup.WithContext(ctx)
+	g.SetLimit(maxGoroutines)
+	for _, ep := range eps {
+		ep := ep


nit: we don't need this in go 1.22 anymore (we use go 1.22 in rollout-operator). same comment in next for-loop.

This makes me a little paranoid as I know others are using this tool. We specify 1.22 in our go.mod but IIUC you can still build and run with an older compiler. And if you don't run the tests it'll probably be race central. :)
Unless you know something that can lessen my paranoia, I think I'll leave it.

We specify 1.22 in our go.mod but IIUC you can still build and run with an older compiler.

It used to be like that, but now it's mandatory. From https://go.dev/doc/modules/gomod-ref#go

The go directive sets the minimum version of Go required to use this module. Before Go 1.21, the directive was advisory only; now it is a mandatory requirement: Go toolchains refuse to use modules declaring newer Go versions.

Add a changelog entry for #146, and prepare changelog for v0.16.0. Co-authored-by: Patryk Prus <[email protected]> --------- Co-authored-by: Patryk Prus <[email protected]>

…s to store (#151) Fix a snag found in #146 where if the "downscaled" annotation/configmap fails to persist, the scale operation is denied, but the pods are not informed via DELETE that they should no longer shutdown.

seizethedave added 3 commits May 9, 2024 14:43

Some readme updates.

06b247b

Undo failed prepare-shutdown calls.

b3b9572

Add some tests for canceling prepare-downscale calls when one fails.

781fe03

seizethedave requested a review from a team as a code owner May 9, 2024 22:28

seizethedave commented May 9, 2024

View reviewed changes

seizethedave added 2 commits May 10, 2024 09:16

Move comment around; const-ize the max goroutines.

3169750

Update span name

4b445c2

pracucci requested a review from pstibrany May 11, 2024 06:20

pr00se approved these changes May 15, 2024

View reviewed changes

pstibrany reviewed May 15, 2024

View reviewed changes

seizethedave added 2 commits May 15, 2024 11:42

Review feedback.

f60279c

Add a comment about >=.

630bb1e

seizethedave merged commit 33c4fcf into grafana:main May 15, 2024
6 checks passed

seizethedave deleted the davidgrant/scaledown-fail-delete branch May 15, 2024 19:47

seizethedave added a commit that referenced this pull request May 22, 2024

Add a changelog entry for #146.

deb93de

seizethedave mentioned this pull request May 22, 2024

Adjust changelog for v0.16.0 #147

Merged

seizethedave added a commit that referenced this pull request May 22, 2024

Adjust changelog for v0.16.0 (#147)

1a50079

Add a changelog entry for #146, and prepare changelog for v0.16.0. Co-authored-by: Patryk Prus <[email protected]> --------- Co-authored-by: Patryk Prus <[email protected]>

seizethedave mentioned this pull request Jun 1, 2024

Admission webhook: Undo prepare-shutdown calls if last-downscale fails to store #151

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invoke DELETE on pod prepare-downscale path if any POSTs failed #146

Invoke DELETE on pod prepare-downscale path if any POSTs failed #146

seizethedave commented May 9, 2024 •

edited

Loading

seizethedave May 9, 2024

pr00se May 14, 2024 •

edited

Loading

seizethedave May 15, 2024

pracucci commented May 11, 2024

pr00se left a comment

pr00se May 14, 2024

seizethedave May 15, 2024

pr00se May 14, 2024 •

edited

Loading

pr00se May 14, 2024

seizethedave May 15, 2024

pstibrany left a comment

pstibrany May 15, 2024

seizethedave May 15, 2024

andyasp May 15, 2024

	req, err := http.NewRequestWithContext(ctx, http.MethodPost, "http://"+ep.url, nil)
	if err != nil {
	level.Error(logger).Log("msg", "error creating HTTP POST request", "err", err)
	}

Invoke DELETE on pod prepare-downscale path if any POSTs failed #146

Invoke DELETE on pod prepare-downscale path if any POSTs failed #146

Conversation

seizethedave commented May 9, 2024 • edited Loading

Choose a reason for hiding this comment

pr00se May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pracucci commented May 11, 2024

pr00se left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pr00se May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seizethedave commented May 9, 2024 •

edited

Loading

pr00se May 14, 2024 •

edited

Loading

pr00se May 14, 2024 •

edited

Loading