Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1752088: Refactor isDeleteAllowed to remove most logic #49

Conversation

michaelgugino
Copy link

@michaelgugino michaelgugino commented Jul 10, 2019

Currently, it is impossible to delete a machine from the
cluster if the machine-controller is running on said
machine. This is mostly an artifact of upstream's
inability to smartly detect there is only one master
running to prevent deletion of the cluster, and
it is not desireable generally.

To support more automated remediation, we should not treat
any particular node specially. Since we drain first,
we will not actually delete the underlying machine-object
until we are successfully started on a new host, preventing
the deletion of the final master.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1752088

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 10, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign vikaschoudhary16
You can assign the PR to them by writing /assign @vikaschoudhary16 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@michaelgugino michaelgugino force-pushed the modify-delete-protection branch 3 times, most recently from 39246e6 to c7457d3 Compare July 10, 2019 20:23
@vikaschoudhary16
Copy link

is not this PR duplicate of #48

Why we need this:

This commit adds the ability to prevent processing of deletion
of machine-objects when the annotation is present.  This
is particularly useful when an automated remediation mechanism
is implemented to serve as a way for administrators to
indicate they do not wish a particular machine to be
remediated for whatever reason.
… logic

Currently, it is impossible to delete a machine from the
cluster if the machine-controller is running on said
machine.  This is mostly an artifact of upstream's
inability to smartly detect there is only one master
running to prevent deletion of the cluster, and
it is not desireable generally.

To support more automated remediation, we should not treat
any particular node specially.  Since we drain first,
we will not actually delete the underlying machine-object
until we are successfully started on a new host, preventing
the deletion of the final master.
@michaelgugino michaelgugino force-pushed the modify-delete-protection branch from c7457d3 to ae2a0ed Compare September 13, 2019 17:24
@@ -37,6 +37,10 @@ const (
// MachineClusterIDLabel is the label that a machine must have to identify the
// cluster to which it belongs.
MachineClusterIDLabel = "machine.openshift.io/cluster-api-cluster"

// PreserveInstanceAnnotation prevents a VM from being deleted by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a justification why we need this atm? If not, I'd prefer to not introduce it in this PR

@enxebre enxebre changed the title Refactor isDeleteAllowed to remove most logic Bug 1752088: Refactor isDeleteAllowed to remove most logic Sep 16, 2019
@openshift-ci-robot
Copy link

@michaelgugino: This pull request references Bugzilla bug 1752088, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1752088: Refactor isDeleteAllowed to remove most logic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 16, 2019
@enxebre
Copy link
Member

enxebre commented Sep 16, 2019

has this been tested by deleting the machine where the controller is running? could we have an e2e for it?

@ingvagabund
Copy link
Member

/test goimports

@ingvagabund
Copy link
Member

ingvagabund commented Sep 16, 2019

In 1-master deployments machine.openshift.io/cluster-api-cluster annotation must be set. In multi-master deployments, as long as the quorum exists, the machine-api plane will get rescheduled to another master node as Mike describes:

  1. master machine with the machine-api plane running will get deleted (let's call the machine A)
  2. the master node gets drained
  3. machine-api plane gets deleted and re-created on another master (let's call it B)
  4. machine-controller starts to run on master machine B, notices the original master machine is being deleted so it reconciles the machine and completes the node draining.
  5. master machine A's underlying instance gets deleted, the machine A object gets deleted

The workflow requires only a single machine being deleted at a given time. Also, master machine can not be deleted while new master is being created. Which is at the moment solely up to educated and aware admin.

@enxebre
Copy link
Member

enxebre commented Sep 16, 2019

I gave it a run successfully. Other than introducing a new annotation here I'm ok moving along with this if no one else has concerns cc @bison @ingvagabund @smarterclayton

@smarterclayton
Copy link

smarterclayton commented Sep 16, 2019

Interesting question on the 1 master case, ultimately this needs to be automatic with a core operator to handle the details of how clusters at minimal quorum can upgrade (2 master -> 3 master, 1 master upgrading). This annotation may not be sufficient on its own, because a higher level integration with actual workloads needs to happen (not breaking quorum except in the case where we have to make progress requires slightly more smarts than I think machine api should have).

As long as the annotation is clearly experimental (we don't support 1 master clusters) I'm ok with that (in doc / comments).

@enxebre
Copy link
Member

enxebre commented Sep 17, 2019

thanks @michaelgugino, closing now in favour of #72. I'll create the follow-ups to fetch across providers

@enxebre enxebre closed this Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants