Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1744245: fix e2e failure #1001

Merged

Conversation

tkashem
Copy link
Collaborator

@tkashem tkashem commented Aug 20, 2019

Back to back delete and recreate of a subscription object causes
operator install to fail.

How to reproduce:

  • Create a CatalogSource object
  • Create a subscription that refers to the CatalogSource above.
  • Wait for the operator to install successfully.
  • Update the CatalogSource
  • Wait for the CatalogSource to become healthy
  • Delete the Subscription object ( from above ).
  • Create the Subscription object ( no time delay between delete
    and create ). Delete and Create can be done one after another,
    there is no need to make them concurrent.

The operator install will fail, Subscription status will have an error
condition ReferencedInstallPlanNotFound. The new install plan object
created by OLM gets deleted by GC.

Root cause:

  • OLM uses a lister to get the list of Subscription(s) in a given
    namespace and sets the relevant subscriptions(s) found in the list as
    owner of the installplan object(s).
  • Because lister uses cache, it will return a deleted subscription
    until the cache is synced.
  • The new installplan object may get an owner ref that points to the
    deleted subscription.
  • GC garbage collects the deleted subscription and consequently
    deletes the new InstallPlan.
  • Subscription reconciler reports that the new InstallPlan object is
    missing and moves the Subscription to a Failed state.

The api audit log has entries that validates that GC is rightfully
"deleting" the new InstallPlan object.

Fix:

  • For now, use a direct non-cached client to retrieve the list of
    Subscription.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 20, 2019
@openshift-ci-robot
Copy link
Collaborator

@tkashem: This pull request references Bugzilla bug 1737081, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[WIP] Bug 1737081: delete and recreate of subscription should not fai…

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 20, 2019
@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 20, 2019
@tkashem tkashem force-pushed the fix-missing-ip branch 2 times, most recently from 7f61d32 to 07cbe84 Compare August 20, 2019 23:28
@openshift-ci-robot
Copy link
Collaborator

@tkashem: This pull request references Bugzilla bug 1737081, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[WIP] Bug 1737081: delete and recreate of subscription should not fai…

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tkashem tkashem changed the title [WIP] Bug 1737081: delete and recreate of subscription should not fai… Bug 1737081: fix e2e failure Aug 20, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 20, 2019
@ecordell
Copy link
Member

Can we pull in @galletti94 's test case and get both in this PR?

Back to back delete and recreate of a subscription object causes
operator install to fail.

How to reproduce:
- Create a CatalogSource object
- Create a subscription that refers to the CatalogSource above.
- Wait for the operator to install successfully.
- Update the CatalogSource
- Wait for the CatalogSource to become healthy
- Delete the Subscription object ( from above ).
- Create the Subscription object ( no time delay between delete
  and create ). Delete and Create can be done one after another,
  there is no need to make them concurrent.

The operator install will fail, Subscription status will have an error
condition `ReferencedInstallPlanNotFound`. The new install plan object
created by OLM gets deleted by GC.

Root cause:
- OLM uses a lister to get the list of Subscription(s) in a given
  namespace and sets the relevant subscriptions(s) found in the list as
  owner of the installplan object(s).
- Because lister uses cache, it will return a deleted subscription
  until the cache is synced.
- The new installplan object may get an owner ref that points to the
  deleted subscription.
- GC garbage collects the deleted subscription and consequently
  deletes the new InstallPlan.
- Subscription reconciler reports that the new InstallPlan object is
  missing and moves the Subscription to a Failed state.

The api audit log has entries that validates that GC is rightfully
"deleting" the new InstallPlan object.

Fix:
- For now, use a direct non-cached client to retrieve the list of
  Subscription.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1744245
Jira: https://jira.coreos.com/browse/OLM-1245
@tkashem
Copy link
Collaborator Author

tkashem commented Aug 21, 2019

Can we pull in @galletti94 's test case and get both in this PR?

Do you want both fixes in the same PR? This might delay other PR(s) blocked. Can we get this merged and unblock folks and then we can focus on the issue @galletti94 is working?

@ecordell
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 21, 2019
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2019
@ecordell ecordell changed the title Bug 1737081: fix e2e failure Bug 1744245: fix e2e failure Aug 21, 2019
@openshift-ci-robot
Copy link
Collaborator

@tkashem: This pull request references Bugzilla bug 1744245, which is invalid:

  • expected the bug to target the "4.2.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1744245: fix e2e failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ecordell
Copy link
Member

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 21, 2019
@openshift-ci-robot
Copy link
Collaborator

@ecordell: This pull request references Bugzilla bug 1744245, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 21, 2019
@openshift-merge-robot openshift-merge-robot merged commit 53524d1 into operator-framework:master Aug 21, 2019
@openshift-ci-robot
Copy link
Collaborator

@tkashem: All pull requests linked via external trackers have merged. Bugzilla bug 1744245 has been moved to the MODIFIED state.

In response to this:

Bug 1744245: fix e2e failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tkashem tkashem deleted the fix-missing-ip branch August 21, 2019 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants