Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Digital Ocean] Implement KOPS validate cluster #9476

Merged
merged 6 commits into from
Jul 15, 2020

Conversation

srikiz
Copy link
Contributor

@srikiz srikiz commented Jul 2, 2020

Implementation of KOPS validate cluster for Digital Ocean KOPS provider.
Tested with ./kops validate cluster and it returns the state of your KOPS cluster as expected.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 2, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @srikiz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 2, 2020
@k8s-ci-robot k8s-ci-robot added the area/provider/digitalocean Issues or PRs related to digitalocean provider label Jul 2, 2020
@hakman
Copy link
Member

hakman commented Jul 2, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 2, 2020
@hakman
Copy link
Member

hakman commented Jul 2, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2020
@rifelpet
Copy link
Member

rifelpet commented Jul 2, 2020

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rifelpet, srikiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 2, 2020
for _, member := range g.Members {

// DO doesn't have a notion of Auto Scaling Group - use the same config Type for both current config and new config.
err := cg.NewCloudInstanceGroupMember(member, g.GroupType, g.GroupType, nodeMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means DO won't be able to know when to do a rolling update. All rolling updates would have to use --force.

Copy link
Member

@rifelpet rifelpet Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the DO Cloud, it doesn't (yet) appear to support deleting instances, so I'm guessing rolling updates don't work at all. Perhaps @srikiz can confirm?

// DeleteInstance is not implemented yet, is func needs to delete a DO instance.
func (c *Cloud) DeleteInstance(i *cloudinstances.CloudInstanceGroupMember) error {
klog.V(8).Info("digitalocean cloud provider DeleteInstance not implemented yet")
return fmt.Errorf("digital ocean cloud provider does not support deleting cloud instances at this time")
}
// DetachInstance is not implemented yet. It needs to cause a cloud instance to no longer be counted against the group's size limits.
func (c *Cloud) DetachInstance(i *cloudinstances.CloudInstanceGroupMember) error {
klog.V(8).Info("digitalocean cloud provider DetachInstance not implemented yet")
return fmt.Errorf("digital ocean cloud provider does not support surging")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @rifelpet - that is something I want to check next.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like DO doesn't support rolling updates. So that's not a blocker for this PR.

The comment is a bit off-point—the second and third parameters just need to be unequal whenever any field of the godo.DropletCreateRequest used to create the instance was different than what would be used to create a new instance. This can be implemented as either a generation id or a hash of the fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thanks @johngmyers - I'll keep a note of this when I implement rolling update feature.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@@ -38,6 +39,8 @@ import (

const TagKubernetesClusterIndex = "k8s-index"
const TagKubernetesClusterNamePrefix = "KubernetesCluster"
const TagKubernetesInstanceGroup = "kops-instancegroup"
const DOInstanceGroupConfig = "do-instancegroup-config"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this referenced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, removed DOInstanceGroupConfig

@hakman
Copy link
Member

hakman commented Jul 3, 2020

Let's wait a bit before merging this.
/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2020
@@ -38,6 +39,8 @@ import (

const TagKubernetesClusterIndex = "k8s-index"
const TagKubernetesClusterNamePrefix = "KubernetesCluster"
const TagKubernetesInstanceGroup = "kops-instancegroup"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have a duplicate of do.TagKubernetesClusterInstanceGroupPrefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johngmyers - I think this needs refactoring. I tried to use it and it looks like we have a cyclic dependency here. Ideally, I should have kept this logic (and few other methods) in the do package, but that's a little more work that I thought of.
Would it be okay if we get this merged, and I take up the clean up activity as a separate PR? Please suggest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps move TagKubernetesClusterInstanceGroupPrefix here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed TagKubernetesClusterInstanceGroupPrefix and instead used the above const.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

@johngmyers
Copy link
Member

The DO code is a bit oversimplified, apparently because DO doesn't support autoscaling groups. But kops doesn't care (much) about the autoscaling aspect of autoscaling groups.

The existing DropletBuilder tags all instances that don't have master role as being in the "nodes" instance group. This ignores the possibility of the admin creating additional instance groups and causes this code to always identify non-master nodes as being in the "nodes" instance group, regardless of which instance group actually created them.

For validation, GetCloudGroups needs to identify:

  • The instance groups deployed (from whatever ASG-equivalent concept the provider code maps instance groups to)
  • The target size of the ASG-equivalent
  • The extant instances, and for each:
    ** Which instance group caused the instance to be created
    ** Which k8s node the instance corresponds to, if the instance has joined the cluster

For rolling update, it needs to identify:
** Whether the instance was created with the version of the spec that is the most current for the ASG-equivalent
** Whether the instance is detached (if the provider supports surging)

For rolling update, it also needs DeleteInstance to terminate the instance and cause a replacement instance to be created. (Except, if it supports surging, to not replace the instance if it was previously detached.)

@johngmyers
Copy link
Member

To further clarify what I wrote, validation will throw failures if there aren't enough instances in a group to cover what that group reports as its target size or if any instances don't have a corresponding node (haven't joined the cluster). Additionally it will throw a failure if GetCloudGroups fails to return a group for an extant InstanceGroup.

Rolling update needs GetCloudGroups to identify which instances are out of date, which NewCloudInstanceGroupMember determines by whether the second and third parameters compare equal. It also needs to be able to terminate instances. It needs validation to eventually succeed after termination, which means that after termination the group needs to eventually have enough instances to match its reported target size.

@srikiz
Copy link
Contributor Author

srikiz commented Jul 3, 2020

To further clarify what I wrote, validation will throw failures if there aren't enough instances in a group to cover what that group reports as its target size or if any instances don't have a corresponding node (haven't joined the cluster). Additionally it will throw a failure if GetCloudGroups fails to return a group for an extant InstanceGroup.

Rolling update needs GetCloudGroups to identify which instances are out of date, which NewCloudInstanceGroupMember determines by whether the second and third parameters compare equal. It also needs to be able to terminate instances. It needs validation to eventually succeed after termination, which means that after termination the group needs to eventually have enough instances to match its reported target size.

  • Thanks, this helps. I need to discuss with @timoreimann if we can support Rolling update for DO. I'll be doing that next week. In the meanwhile, I would like to get this current PR going, if you think it's okay. Below is the validate cluster response -

`Validating cluster dev5.k8s.local

INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
master-tor1-1 Master s-2vcpu-4gb 1 1 tor1
master-tor1-2 Master s-2vcpu-4gb 1 1 tor1
master-tor1-3 Master s-2vcpu-4gb 1 1 tor1
nodes Node s-2vcpu-4gb 2 2 tor1

NODE STATUS
NAME ROLE READY
10.137.176.248 master True
10.137.240.102 master True
10.137.240.248 master True
10.137.240.3 node True
10.137.8.136 node True

Your cluster dev5.k8s.local is ready`


for _, member := range g.Members {

// DO doesn't have a notion of Auto Scaling Group - use the same config Type for both current config and new config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// DO doesn't have a notion of Auto Scaling Group - use the same config Type for both current config and new config.
// TODO use a hash of the godo.DropletCreateRequest fields for second and third parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, updated.

@johngmyers
Copy link
Member

@srikiz Rolling update support does not block this PR. In my mind the thing that blocks this PR is its identifying all non-master instances as being in the "nodes" instance group. The droplet creation code should tag instances with the name of the instance group that created them so this code can assign them to the right group.

The code for assigning the group of the masters seems unnecessarily complex. If there's a reason the masters need to have special case string manipulation, please explain.

@johngmyers
Copy link
Member

I believe there is scope to share code between this and dotasks.Droplet.Find(). dotasks.Droplet is really a DropletGroup.

@srikiz
Copy link
Contributor Author

srikiz commented Jul 7, 2020

@srikiz Rolling update support does not block this PR. In my mind the thing that blocks this PR is its identifying all non-master instances as being in the "nodes" instance group. The droplet creation code should tag instances with the name of the instance group that created them so this code can assign them to the right group.

The code for assigning the group of the masters seems unnecessarily complex. If there's a reason the masters need to have special case string manipulation, please explain.

Thanks for the inputes - I updated code to also check for instance group called "nodes" if it doesn't match the master. While creating the droplet, we currently tag the master with 3 different tags.

  • kubernetescluster: {clusterName}
  • Kops-instancegroup: master-{region}
  • k8s-index: 1 (will always be 1 for a single master cluster). For a multi-master node, it can be K8s-Index:2, or k8s-index:3
    The instance group itself is a combination of kops-instancegroup and the k8s-index.
    I will keep a note to find if I could update kops-instancegroup itself to the correct value (instead of concatenating with k8s-index).

For the worker nodes we have the below tags.

  • Kubernetescluster: {clusterName}
  • Kops-instancegroup: nodes

I updated code to also check if the matching "kops-instancegroup: nodes" tag is available on the droplet.

@johngmyers
Copy link
Member

Why don't you put the instancegroup's ObjectMeta.Name in kops-instancegroup?

It looks like the code will only return at most one CloudInstanceGroup of non-master type. That means cluster validation will always fail if there are two or more non-master instance groups.

@srikiz
Copy link
Contributor Author

srikiz commented Jul 7, 2020

Why don't you put the instancegroup's ObjectMeta.Name in kops-instancegroup?

It looks like the code will only return at most one CloudInstanceGroup of non-master type. That means cluster validation will always fail if there are two or more non-master instance groups.

I don't know if there is a way in DO to add more non-master instance groups. DO doesn't support any new instance group creation. I am currently managing with the help of tags. I tag droplets that act as worker nodes with "kops-instancegroup:nodes" tag. If there are more than 1 DO KOPS cluster, I currently retrieve all droplets that match the KOPS cluster name here - https://github.com/kubernetes/kops/pull/9476/files#diff-09d5ce1959bcc4dffcb0d3d88a1ffb55R332

I believe DO currently don't have support for adding additional instance groups.

@srikiz
Copy link
Contributor Author

srikiz commented Jul 8, 2020

I believe there is scope to share code between this and dotasks.Droplet.Find(). dotasks.Droplet is really a DropletGroup.

Yeah. I moved the consts from the current package (upup/pkg/fi/cloudup/do) into pkg/resources/digitialocean. It I had to modify droplets.go, api_loadbalancer.go, master_volumes.go, etc. I thought it would be safer to make this change as a separate PR. Hope that's okay.

@johngmyers
Copy link
Member

dotasks.Droplet is essentially a (non-automatic) scaling group. Its RenderDO() handles instance groups by creating enough droplets to satisfy the instancegroup's expected count. I don't see anything that would prevent Droplet from creating droplets for multiple worker instance groups.

Droplet puts an ASG name, which is derived from the instancegroup and cluster names, into the Name field of each droplet it creates. This can be used to determine which instancegroup a given droplet belongs to. Droplet.Find() appears to essentially do this, though it filters for the instancegroup it was instantiated for.

@srikiz
Copy link
Contributor Author

srikiz commented Jul 9, 2020

dotasks.Droplet is essentially a (non-automatic) scaling group. Its RenderDO() handles instance groups by creating enough droplets to satisfy the instancegroup's expected count. I don't see anything that would prevent Droplet from creating droplets for multiple worker instance groups.

Droplet puts an ASG name, which is derived from the instancegroup and cluster names, into the Name field of each droplet it creates. This can be used to determine which instancegroup a given droplet belongs to. Droplet.Find() appears to essentially do this, though it filters for the instancegroup it was instantiated for.

I see, got it. I will put Objectmeta.Name to the kops-instancegroup tag and verify it. Will update the PR shortly.

@srikiz
Copy link
Contributor Author

srikiz commented Jul 14, 2020

@johngmyers - I have incorporated as per your comments - that really helped !
Please do have a look when you get a chance. Thanks !

@hakman
Copy link
Member

hakman commented Jul 14, 2020

/retest

1 similar comment
@srikiz
Copy link
Contributor Author

srikiz commented Jul 14, 2020

/retest

Copy link
Member

@johngmyers johngmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better. I'm putting on a hold in case you want to use the standard type for GroupType. Feel free to cancel the hold if you want it to go in as-is.
/hold
/lgtm

type DOInstanceGroup struct {
ClusterName string
InstanceGroupName string
GroupType string // will be either "master" or "worker"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pkg/apis/kops/InstanceGroupRole as the type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep a note to update this when I touch the rolling update feature. I'll add a new tag when I create a droplet that holds this information, so I can extract it from the droplet tag.
Thanks for your help in reviewing this @johngmyers !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get the group role from the instancegroup spec of the ig that was in the tag.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 14, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 14, 2020
@srikiz
Copy link
Contributor Author

srikiz commented Jul 15, 2020

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2020
@k8s-ci-robot k8s-ci-robot merged commit 0658248 into kubernetes:master Jul 15, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jul 15, 2020
@srikiz srikiz deleted the DO-implement-validate-cluster branch October 25, 2021 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/digitalocean Issues or PRs related to digitalocean provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants