-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(citest): Make integration test instance preemptiable. #2
Conversation
This is surprising to me, given that the tests are already flaky enough (particularly recently) and this appears to be adding another potential source of flakiness (albeit a low probability one). The daily cleanup script should be cleaning these up, avoiding the need to rely on this 24-hour time window. (And if I understand correctly, when these are terminated after 24 hours, they are not actually deleted...the instance and disk are still around, they're just stopped, so we'd still be leaking resources.) |
Is there any monitoring or logging around how often the VMs running the integration tests get pre-empted so we can see how this change has fared? The GCE docs state:
which seems like more flakiness than we'd want in our tests (particularly given that the docs also state that newly-started instances are more likely to be selected for pre-emption). Also, what will it look like in Jenkins if the machine running the CI tests gets pre-empted? I think this would be really really hard to debug, and I'm worried about adding a new (completely avoidable) failure mode at a time when we're trying to onboard new people to an already hard-to-learn system. It seems like this would lead to either: (1) a lot of engineering time trying to dig into these failures, or (2) another class of things where we just restart and hope it passes again. |
As I was curious, I looked at the Stackdriver logs for the project, and it looks like the VMs running the tests have been pre-empted pretty frequently. Here's the query in case that link stops working:
This shows that there have been 20 pre-emptions in the last 2 weeks, which is more than one per day (each of which would almost certainly lead to a failed test). To check whether these are happening shortly after startup (ie, when the tests were running) as opposed to as a final cleanup, I ran this query to spot-check the last few pre-emptions. The query is:
This shows the creation and pre-emption of each of the last 5 pre-emptions. The time from starting to pre-emption was:
So in one case (24h) this caught a machine that was likely leaked, but in the other cases almost certainly caused a flake in the tests. (One could argue about whether the 44 min affected the tests, but the 3 min definitely did.) So it seems like this has significantly added to the flakiness of the tests, and should probably be reverted. |
This should make it so that VMs stood up for integration testing don't live longer than 24 hours, with a small chance they'll get killed mid-testing. If we find that these VMs are getting preempted a lot, we can remove this setting.