fix: fail fast may cause Serial spec or cleanup Node interrupted #1178

cvvz · 2023-04-04T04:38:00Z

Problem

When using fail-fast, ginkgo start a goroutine to poll from server every 500ms (ABORT_POLLING_INTERVAL) to see if current spec should be interrupted by other process.

There is a chance that before Serial Spec or cleanup Node start to run, the interruption status is still unset, so they won't be skipped and start to run. However, it may get interrupted by other process during running.

Please refer to https://github.com/cvvz/go-playground/tree/master/ginkgo/sample for the steps of reproduction.

Resolution

Wait for ABORT_POLLING_INTERVAL before Serial Spec start to run, this will make sure the interruption status is the latest.
Ignore interruption by other process when running cleanup Nodes.

onsi · 2023-04-04T15:27:24Z

hey there - thanks for taking the time to put together a reproducer and for proposing a fix. this part of the codebase is some of the most complex and I appreciate the effort!

I agree that cleanup nodes should not be interrupted when a different process fails during --fail-fast. That's a great catch and I think your fix is correct.

~~I'm not sure I follow the issue around Serial specs.~~ It's true there is a window of time in which they may start to run and then be interrupted ~~- but I'm not sure I follow why this is problematic~~. I'd prefer to not add a 500ms delay for all suites in the case of Serial specs running. If this window of time is, indeed, problematic I'd prefer a solution where we teach InterruptHandler how to CheckForCrossProcessInterrupts() and call that in suite.go instead of sleeping. ~~But first - can you share some more around why this window of time is proving problematic for you?~~ (edit: never mind... after some thought I think I agree now that we should fix this gap)

Also - we'll need to add tests for these in internal_integration. Particularly for the After* cleanup nodes not aborting. I'd be happy to do that when I pull this in. In fact, I'm happy to take it from here if you'd prefer - or if you'd like to rework the InterruptHandler portion of the PR please feel free to. Let me know either way.

cvvz · 2023-04-06T12:17:38Z

If this window of time is, indeed, problematic I'd prefer a solution where we teach InterruptHandler how to CheckForCrossProcessInterrupts() and call that in suite.go instead of sleeping.

Hi @onsi , I'm not so sure how to call CheckForCrossProcessInterrupts() in suite.go instead of sleeping, could you share more detailes? Or you could just patch on this PR directly, I'm ok with that. Thanks.

onsi · 2023-04-06T13:44:52Z

no worries - i'll try to work on this in the next couple of days.

onsi · 2023-05-03T03:16:38Z

hey - sorry, i had another project take priority over this. it's still on my radar, but it's going to be a few weeks before i can get to it

1. inter-process aborts should not interrupt cleanup nodes 2. whenever we fetch interrupt status, check and see if an abort has happened. if it has ensure we return the latest, correct, abort state. this allows us to avoid accidentally starting the next spec because the ABORT_POLLING_INTERVAL hasn't fired yet

onsi · 2023-05-03T19:54:04Z

hey - i ended up prioritizing this today to get it done. CI is running now but assuming it succeeds i'll be pulling this in and cutting a release. i added tests to cover the case of abort interrupting cleanup nodes.

i also came up with a different approach for closing the gap between when an abort occurs and when the other ginkgo processes notice. now instead of just polling we also check whenever interrupt Status is requested. this ensures that nothing that shouldn't run ever runs (not just serial specs)

cvvz mentioned this pull request Apr 4, 2023

test: speed up e2e test by running in parallel kubernetes-sigs/blob-csi-driver#861

Merged

4 tasks

cvvz and others added 2 commits May 3, 2023 09:06

fix: fail fast may cause Serial spec or cleanup Node interrupted

c8bcfb0

onsi force-pushed the fix-parallel-fail-fast branch from 2a8be0e to 15b4871 Compare May 3, 2023 19:51

onsi added 2 commits May 3, 2023 14:00

fix race condition in internal integration suite

df4693a

fix internal test race condition

e291129

onsi merged commit 8dea88b into onsi:master May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fail fast may cause Serial spec or cleanup Node interrupted #1178

fix: fail fast may cause Serial spec or cleanup Node interrupted #1178

cvvz commented Apr 4, 2023 •

edited

Loading

onsi commented Apr 4, 2023 •

edited

Loading

cvvz commented Apr 6, 2023

onsi commented Apr 6, 2023

onsi commented May 3, 2023

onsi commented May 3, 2023

fix: fail fast may cause Serial spec or cleanup Node interrupted #1178

fix: fail fast may cause Serial spec or cleanup Node interrupted #1178

Conversation

cvvz commented Apr 4, 2023 • edited Loading

Problem

Resolution

onsi commented Apr 4, 2023 • edited Loading

cvvz commented Apr 6, 2023

onsi commented Apr 6, 2023

onsi commented May 3, 2023

onsi commented May 3, 2023

cvvz commented Apr 4, 2023 •

edited

Loading

onsi commented Apr 4, 2023 •

edited

Loading