Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues #295

Merged
merged 14 commits into from
Feb 20, 2020

Conversation

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 7, 2020
@openshift-ci-robot
Copy link

@enxebre: This pull request references Bugzilla bug 1796595, which is invalid:

  • expected the bug to target the "4.2.z" release, but it targets "---" instead
  • expected Bugzilla bug 1796595 to depend on a bug in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but no dependents were found

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1796595: Mitigate aws eventual consistency issues

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 7, 2020
@enxebre enxebre changed the title Bug 1796595: Mitigate aws eventual consistency issues Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues Feb 7, 2020
@enxebre
Copy link
Member Author

enxebre commented Feb 7, 2020

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 7, 2020
@openshift-ci-robot
Copy link

@enxebre: This pull request references Bugzilla bug 1796595, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 7, 2020
@openshift-ci-robot
Copy link

@enxebre: This pull request references Bugzilla bug 1796595, which is valid.

In response to this:

Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alexander-demicev
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2020
Copy link

@bison bison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@enxebre
Copy link
Member Author

enxebre commented Feb 10, 2020

cc @mfojtik

Copy link

@bison bison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bison

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2020
@ashcrow
Copy link
Member

ashcrow commented Feb 10, 2020

I'm having a bit of a hard time following this one. The link BZ's upstream fix that was verified seems different (with a subset of commits). Can I get a quick rundown in this thread so I can better help approve?

@enxebre
Copy link
Member Author

enxebre commented Feb 11, 2020

Hey @ashcrow thanks for having a look.

#269 was included here https://bugzilla.redhat.com/show_bug.cgi?id=1772163 which was then mark as dup of https://bugzilla.redhat.com/show_bug.cgi?id=1761882

#267, #272, #268 and #270 are comments and updates that were introduced as part of the normal release feature development, so has no BZ but are relevant for the context and help fix the problem "Mitigate aws eventual consistency issues".

#280 Fixed a regression introduced by these fixes so it belongs to this PR.

d4912fb Is a transparent change necessary to reduce the conflicts hassle and keep code between releases aligned for easing farther back ports.

@bparees
Copy link

bparees commented Feb 19, 2020

@enxebre

https://bugzilla.redhat.com/show_bug.cgi?id=1796595 depends on a 4.4.0 BZ, it needs to depend on a 4.3.z BZ so we can confirm this was fixed in 4.3.z before merging the change into 4.2.z.

(the bot should have told you this, i'm looking into why it did not)

@bparees bparees added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Feb 19, 2020
@enxebre
Copy link
Member Author

enxebre commented Feb 19, 2020

/bugzilla refresh
@bparees sorted the bz wrong referencing. PTAL

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 19, 2020
@openshift-ci-robot
Copy link

@enxebre: This pull request references Bugzilla bug 1796595, which is valid.

In response to this:

/bugzilla refresh
@bparees sorted the bz wrong referencing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 19, 2020
@bparees bparees added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Feb 19, 2020
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 19, 2020
enxebre and others added 14 commits February 20, 2020 09:47
This is to satisfy contract with the machine controller so it can react appropriately and signal a machine as failed with the given error message when an InvalidMachineConfiguration error is reported, main cases are:
- The spec is invalid:
	Unable to be unmarshalled
	Missing expected labels by design
	Referencing non existing kubernetes resources
- The input values provoke a client error 4xx response from the cloud call to create a machine.
This changes the behavior of the actuator's `getMachineInstances`
method to first look for a machine's backing instance by the instance
ID in the provider status.  If the instance ID is not set, it falls
back to the previous method of filtering by tags.  This should help
prevent issues where AWS creates the instance, but takes some time to
propagate tag values, which can result in Exists() wrongly returning
false.
This attemps to mitigate https://bugzilla.redhat.com/show_bug.cgi?id=1772163. Instead of telling the API request to filter by state, we fetch all and discriminate only terminated instances
…tances

getInstances has taken an instanceStateFilter argument since it landed
as GetInstances in 72b4823 (Migrate cluster operator implementation
of the aws actuator, 2018-08-04,
openshift#2).
…ancesOutput

So we can create pending, terminated, etc. instances for testing.
This should scale better than
stub{State}InstanceDescribeInstancesOutput from
openshift/cluster-api-provider-aws@c8e341f22f (Stop retieving
instances by state on API call, 2019-11-14,
openshift#269).
…ceByID

openshift/cluster-api-provider-aws@c8e341f22f (Stop retieving
instances by state on API call, 2019-11-14,
openshift#269) did part of this.  This commit:

* Restores the instanceStateFilter argument to getInstanceByID, to
  avoid having two definitions (existingInstanceStates and a local !=
  Terminated in getExistingInstanceByID) that define what it means to
  exist.  This will also give us logged error messages like:

    instance i-123 state terminated is not in running, pending, stopped, stopping, shutting-down

  which will make it easier to find and fix omissions in
  existingInstanceStates.

* Adds a -running suffix to has-status-search-by-id, so it's easier to
  distinguish from the terminated case.

* Converts to the hyphenated has-status-search-by-id-terminated,
  instead of mixing hyphens and spaces in a case name.

* Uses inline argument for the DescribeInstances mocks, so we don't
  have to use request and request2.  These aren't so big that inlining
  them makes the mock setup unreadable.

* Returns an empty set of instances in the instance-listing fallback
  call in has-status-search-by-id-terminated (which we now require to
  happen after the by-ID call).  The list call is filtered by state,
  so it wouldn't return a terminated instance.
Like openshift/cluster-api-provider-aws@e4e94b1e08
(pkg/actuators/machine/utils: Client-side state filtering in
getInstanceByID, 2019-11-13, openshift#268),
but for getInstances.  The cluster/name filtering is already enough to
limit the list response to a reasonable size, so we can perform the
state filtering locally in order to produce useful error messages.

Also fix a bug in TestGetMachineInstances about the expected return
type of getMachineInstances; it's a slice, not a single instance.

Also, fix a few "Running" -> InstanceStateNameRunning bugs; the AWS
string is "running", but we shouldn't care either way.
… comment

The updateProviderID comment was probably a copy/paste error when
setMachineCloudProviderSpecifics landed in 716083f (Set additional
machine annotations/labels to get pretty machine output, 2019-07-23,
openshift#242).  I haven't read through the
function carefully enough to be able to propose a comment that adds
much beyond the function name, which is fairly clear on its own.  But
I know the updateProviderID comment is not a good fit ;).
Avoid [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-785b4687bf-wspnd/machine-controller/machine-controller/logs/current.log | grep -B5 'ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine'
  2019-11-18T01:08:14.686879882Z I1118 01:08:14.686811       1 actuator.go:552] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Found instance by id: i-04e60273f38857eed
  2019-11-18T01:08:14.686879882Z I1118 01:08:14.686839       1 actuator.go:502] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Instance exists as "i-04e60273f38857eed"
  2019-11-18T01:08:14.686879882Z I1118 01:08:14.686853       1 controller.go:284] Reconciling machine "ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696" triggers idempotent update
  2019-11-18T01:08:14.686879882Z I1118 01:08:14.686862       1 actuator.go:406] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: updating machine
  2019-11-18T01:08:14.68696548Z I1118 01:08:14.686938       1 actuator.go:414] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: obtaining EC2 client for region
  2019-11-18T01:08:14.741237517Z I1118 01:08:14.741189       1 actuator.go:430] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine

by waiting a bit for AWS to resolve any eventual-consistency issues
[2] before giving up on a machine and claiming it failed.  With this
commit, we'll try again until requeueAfterSeconds have passed since
the last status update when we have a machine ID to look for (from
creation).

The efficiency of the retry request is weakened a bit by the AWS
provider / machine controller pairing apparently ignoring requeue
delays [3].  And if we respected retries like that, we probably
wouldn't hit the guard this commit is adding.  But a busy loop here is
still better than abandoning the machine, and we don't have a better
fix for the ignored requeue delay in the pipe.

Personally, I'd rather tie this guard to the Provisioning and
Provisioned machine phases, or the MachineCreationSucceeded condition,
or some such to avoid excessive delays if folks were bumping the
machine status and LastUpdated for other reasons.  But Alberto points
out that machine phases are supposed to be a higher-level abstraction
that doesn't leak down into the provider [4] and he prefers switching
on ProviderID [5], so that's what we have in this commit.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1772163#c11
[2]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1772192
[4]: openshift#271 (comment)
[5]: openshift#271 (comment)
Same guard Update grew in
openshift/cluster-api-provider-aws@09c1fe9cee9e68
(pkg/actuators/machine/actuator: Eventual-consistency wait for
machines, 2019-11-18, openshift#271), but now
for Exists and Describe.  This should protect us from flows like [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-nr00dj4c-stable-sha256-b6dffb34bfe5c198d7e9ad032322277c2a129e45a4e7f2084a83ab560530f683/namespaces/openshift-machine-api/pods/machine-api-controllers-9bd7457fc-tfdmg/machine-controller/machine-controller/logs/current.log | grep ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz
  ...
  2019-11-22T17:39:34.642709291Z I1122 17:39:34.642651       1 actuator.go:194] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: ProviderID updated at machine spec: aws:///us-east-1a/i-08b09536e36303450
  ...
  2019-11-22T17:39:34.660245157Z I1122 17:39:34.660202       1 actuator.go:668] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance state still pending, returning an error to requeue
  2019-11-22T17:39:34.660327796Z W1122 17:39:34.660285       1 controller.go:321] Failed to create machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz": requeue in: 20s
  2019-11-22T17:39:34.660426954Z I1122 17:39:34.660380       1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz"
  ...
  2019-11-22T17:39:34.660426954Z I1122 17:39:34.660416       1 actuator.go:493] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Checking if machine exists
  2019-11-22T17:39:34.732221813Z I1122 17:39:34.732177       1 actuator.go:501] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance does not exist
  2019-11-22T17:39:34.732221813Z I1122 17:39:34.732202       1 controller.go:428] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" going into phase "Failed"
  ...

Also, because the code was basically the same between Exists and
Describe, I've refactored Exists to be a thin wrapper around Describe.
It's unfortunate that Describe does not take a Context argument, but
there's no regression because Exists was ignoring its Context argument
before anyway ;).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1761882#c15
If the Spec.ProviderID field is present but an empty string,
we may encounter issues.  This is particularly the case in
DR scenarios when a user is attempting to copy an existing
machine in attempt to create one.
@openshift-ci-robot openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 20, 2020
@enxebre
Copy link
Member Author

enxebre commented Feb 20, 2020

Rebased after #294 made it in. Should be good now if tests say so.

Copy link

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2020
@openshift-merge-robot openshift-merge-robot merged commit 75ed797 into openshift:release-4.2 Feb 20, 2020
@openshift-ci-robot
Copy link

@enxebre: All pull requests linked via external trackers have merged. Bugzilla bug 1796595 has been moved to the MODIFIED state.

In response to this:

Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants