Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(citest): Make integration test instance preemptiable. #2

Merged
merged 1 commit into from
Feb 12, 2020

Conversation

ttomsu
Copy link
Member

@ttomsu ttomsu commented Feb 7, 2020

This should make it so that VMs stood up for integration testing don't live longer than 24 hours, with a small chance they'll get killed mid-testing. If we find that these VMs are getting preempted a lot, we can remove this setting.

@ttomsu ttomsu merged commit 414bc35 into spinnaker:master Feb 12, 2020
@ttomsu ttomsu deleted the premptable branch February 12, 2020 15:31
@ezimanyi
Copy link
Contributor

ezimanyi commented Mar 4, 2020

This is surprising to me, given that the tests are already flaky enough (particularly recently) and this appears to be adding another potential source of flakiness (albeit a low probability one). The daily cleanup script should be cleaning these up, avoiding the need to rely on this 24-hour time window. (And if I understand correctly, when these are terminated after 24 hours, they are not actually deleted...the instance and disk are still around, they're just stopped, so we'd still be leaking resources.)

@ezimanyi
Copy link
Contributor

ezimanyi commented Mar 4, 2020

Is there any monitoring or logging around how often the VMs running the integration tests get pre-empted so we can see how this change has fared? The GCE docs state:

For reference, we've observed from historical data that the average preemption rate varies between 5% and 15% per day per project, on a seven-day average, occasionally spiking higher depending on time and zone.

which seems like more flakiness than we'd want in our tests (particularly given that the docs also state that newly-started instances are more likely to be selected for pre-emption).

Also, what will it look like in Jenkins if the machine running the CI tests gets pre-empted? I think this would be really really hard to debug, and I'm worried about adding a new (completely avoidable) failure mode at a time when we're trying to onboard new people to an already hard-to-learn system. It seems like this would lead to either: (1) a lot of engineering time trying to dig into these failures, or (2) another class of things where we just restart and hope it passes again.

@ezimanyi
Copy link
Contributor

ezimanyi commented Mar 4, 2020

As I was curious, I looked at the Stackdriver logs for the project, and it looks like the VMs running the tests have been pre-empted pretty frequently. Here's the query in case that link stops working:

resource.type="gce_instance"
operation.producer="compute.instances.preempted"

This shows that there have been 20 pre-emptions in the last 2 weeks, which is more than one per day (each of which would almost certainly lead to a failed test).

To check whether these are happening shortly after startup (ie, when the tests were running) as opposed to as a final cleanup, I ran this query to spot-check the last few pre-emptions. The query is:

resource.type="gce_instance"
operation.last=true
(protoPayload.methodName="v1.compute.instances.insert" OR operation.producer="compute.instances.preempted")
(protoPayload.resourceName="projects/spinnaker-community/zones/us-west2-c/instances/jenkins-validate-bom-k8sk-4238"
OR protoPayload.resourceName="projects/spinnaker-community/zones/us-west2-c/instances/jenkins-validate-bom-vmg-4229"
OR protoPayload.resourceName="projects/spinnaker-community/zones/us-west2-c/instances/jenkins-validate-bom-vmg-4216"
OR protoPayload.resourceName="projects/spinnaker-community/zones/us-west2-c/instances/jenkins-validate-bom-k8sk-4216"
OR protoPayload.resourceName="projects/spinnaker-community/zones/us-west2-c/instances/jenkins-validate-bom-k8sk-4215")

This shows the creation and pre-emption of each of the last 5 pre-emptions. The time from starting to pre-emption was:

  • 3 min
  • 3 min
  • 24 h
  • 3 min
  • 44 min

So in one case (24h) this caught a machine that was likely leaked, but in the other cases almost certainly caused a flake in the tests. (One could argue about whether the 44 min affected the tests, but the 3 min definitely did.)

So it seems like this has significantly added to the flakiness of the tests, and should probably be reverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants