Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: azure gc cleans up empty clusters #109438

Merged
merged 1 commit into from
Aug 31, 2023

Conversation

DarrylWong
Copy link
Contributor

Azure allows for clusters to exist with no attached VM, which roachprod assumes is not possible. This would occur if azure.create failed after creating a resource group but before creating a VM. Roachprod GC only searches for VMs when searching for clusters, so these VM-less clusters would never be found and cleaned up.

This change adds the ability to list empty clusters by querying resource groups instead of VMs. This is used by GC and destroy jobs to correctly identify and cleanup these dangling resources.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-3373
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@DarrylWong DarrylWong marked this pull request as ready for review August 24, 2023 16:53
@DarrylWong DarrylWong requested a review from a team as a code owner August 24, 2023 16:53
@DarrylWong DarrylWong requested review from srosenberg and smg260 and removed request for a team August 24, 2023 16:53
@DarrylWong DarrylWong linked an issue Aug 24, 2023 that may be closed by this pull request
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Thanks for fixing this longstanding issue 🎉

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @smg260, and @srosenberg)


-- commits line 16 at r1:
Should we also include:

Fixes: #58668

?


pkg/roachprod/cloud/cluster_cloud.go line 237 at r1 (raw file):

	for _, c := range cloud.Clusters {
		if len(c.VMs) == 0 {
			return nil, errors.Errorf("found no VMs in Cluster %s", c.Name)

Nittest of nits: cluster (no need for capitalization)


pkg/roachprod/cloud/gc.go line 44 at r1 (raw file):

	exp := c.ExpiresAt()
	// Clusters without VMs shouldn't exist and are likely dangling resources.
	if c.VMs[0].EmptyCluster {

It wouldn't be a bad idea to make this check a method on Cluster to contain the knowledge of VMs[0].EmptyCluster. It's a bit of a workaround approach, so if we want to change how we deal with this situation later, the encapsulation will help.


pkg/roachprod/vm/vm.go line 127 at r1 (raw file):

	// EmptyCluster indicates that the VM does not exist. Certain providers allow for empty
	// clusters, but roachprod does not allow VM-less clusters except when deleting them.
	// A fake VM may be used in this scenario.

Some comments on the doc:

  • would be good to call out that this only applies to Azure (instead of "certain providers")
  • a fake VM will be used in this case.

pkg/roachprod/vm/azure/azure.go line 554 at r1 (raw file):

		// Clusters that have a VM were already found above. We just want
		// to add the VM-less clusters.
		foundClusters := make(map[string]bool)

Might as well keep track of these clusters while we're iterating over VMs in the loop above. The (very) small memory footprint is justified by the removal of this loop, IMO.


pkg/roachprod/vm/azure/azure.go line 558 at r1 (raw file):

			clusterName, err := v.ClusterName()
			if err != nil {
				v.Errors = append(v.Errors, vm.ErrInvalidName)

If you do decide to keep this loop, I don't think we need to do this; if anything, this is already done by the caller (ListCloud)


pkg/roachprod/vm/azure/azure.go line 564 at r1 (raw file):

		for it.NotDone() {
			found := it.Value()

Nit: maybe a better name would be resourceGroup? I see that the loop above calls it found, but I kept thinking "what was found again?" when reviewing the code.


pkg/roachprod/vm/azure/azure.go line 576 at r1 (raw file):

			}

			if _, ok := found.Tags[vm.TagRoachprod]; !ok {

We should probably move this check to the very top of the loop.


pkg/roachprod/vm/azure/azure.go line 585 at r1 (raw file):

			tags := make(map[string]string)
			for key, value := range found.Tags {
				tags[key] = *value

Is there any code that relies on these tags existing?


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

			} else {
				m.Errors = append(m.Errors, vm.ErrNoExpiration)
			}

In the case of IncludeEmptyClusters, these parse errors look to me like they should be hard errors (i.e., returned to the caller), since we'll rely on them during GC, no?

Copy link
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)


-- commits line 16 at r1:

Previously, renatolabs (Renato Costa) wrote…

Should we also include:

Fixes: #58668

?

Done.


pkg/roachprod/cloud/cluster_cloud.go line 237 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nittest of nits: cluster (no need for capitalization)

Done.


pkg/roachprod/cloud/gc.go line 44 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

It wouldn't be a bad idea to make this check a method on Cluster to contain the knowledge of VMs[0].EmptyCluster. It's a bit of a workaround approach, so if we want to change how we deal with this situation later, the encapsulation will help.

Done.


pkg/roachprod/vm/vm.go line 127 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Some comments on the doc:

  • would be good to call out that this only applies to Azure (instead of "certain providers")
  • a fake VM will be used in this case.

Done.


pkg/roachprod/vm/azure/azure.go line 554 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Might as well keep track of these clusters while we're iterating over VMs in the loop above. The (very) small memory footprint is justified by the removal of this loop, IMO.

Done.


pkg/roachprod/vm/azure/azure.go line 558 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

If you do decide to keep this loop, I don't think we need to do this; if anything, this is already done by the caller (ListCloud)

Done.


pkg/roachprod/vm/azure/azure.go line 576 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

We should probably move this check to the very top of the loop.

Done.


pkg/roachprod/vm/azure/azure.go line 585 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Is there any code that relies on these tags existing?

Nope, removed.


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

In the case of IncludeEmptyClusters, these parse errors look to me like they should be hard errors (i.e., returned to the caller), since we'll rely on them during GC, no?

If we make it a hard error won't GC error out too (since it calls ListCloud) and be unable to clear up the empty clusters anyway? It would be unable to clean up any clusters in general even if they have a correctly parsed expiration time just because one empty cluster doesn't. It would also remove the ability for the user to just call roachprod destroy directly on the cluster.

Would logging a warning telling the user that they might have to manually destroy the cluster be a better alternative?

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @smg260, and @srosenberg)


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, DarrylWong wrote…

If we make it a hard error won't GC error out too (since it calls ListCloud) and be unable to clear up the empty clusters anyway? It would be unable to clean up any clusters in general even if they have a correctly parsed expiration time just because one empty cluster doesn't. It would also remove the ability for the user to just call roachprod destroy directly on the cluster.

Would logging a warning telling the user that they might have to manually destroy the cluster be a better alternative?

Yeah, you're right, there wouldn't be much we could do in this case. If we were indeed to hit this case, c.CreatedAt would be empty and GC would destroy the resource group (as it should).

Copy link
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Yeah, you're right, there wouldn't be much we could do in this case. If we were indeed to hit this case, c.CreatedAt would be empty and GC would destroy the resource group (as it should).

Ahh that is something I didn't think to test, how GC would deal with malformed tags/vm names. Just tested it and it tries to delete each VM individually which obviously doesn't work in this case since the VM is fake for empty clusters. Ends up trying to delete a VM that doesn't exist and erroring out. Adding some logic to treat these as clusters.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DarrylWong, @smg260, and @srosenberg)


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, DarrylWong wrote…

Ahh that is something I didn't think to test, how GC would deal with malformed tags/vm names. Just tested it and it tries to delete each VM individually which obviously doesn't work in this case since the VM is fake for empty clusters. Ends up trying to delete a VM that doesn't exist and erroring out. Adding some logic to treat these as clusters.

Not sure I get it -- why do we need the new badClusters logic? Wouldn't it be sufficient to to not add the VM to badVMs when vm.EmptyCluster?

Copy link
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @smg260, and @srosenberg)


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Not sure I get it -- why do we need the new badClusters logic? Wouldn't it be sufficient to to not add the VM to badVMs when vm.EmptyCluster?

We call [i] on badVMs, where p.Delete for azure is [ii].

When it calls client.Delete it errors out for empty clusters, since its expects a VM name to be passed in. Our VM in the case of an empty cluster is fake so azure won't find anything and return an error.

I suppose the logic of checking if a vm is fake and creating a resource group client to delete it could be moved into to azure.Delete if that makes it easier to understand. So maybe something like [iii]. This way we can keep just add the empty VMs to badVMs and have delete deal with it.

I think the reason why this is only a bad VMs GC issue is that this is the only place that tries to delete a single VM rather than the entire cluster.

Code snippet (i):

if len(badVMs) > 0 {
	// Destroy bad VMs.
	err := vm.FanOut(badVMs, func(p vm.Provider, vms vm.List) error {
		return p.Delete(l, vms)
	})
	if err != nil {
		postError(l, client, channel, err)
	}
}

Where p.Delete in azure is:

Code snippet (ii):

func (p *Provider) Delete(l *logger.Logger, vms vm.List) error {
	...
	client := compute.NewVirtualMachinesClient(*sub.ID)
	...
	for _, vm := range vms {
		...
		future, err := client.Delete(ctx, parts.resourceGroup, parts.resourceName, nil)
		...
	}
}

Code snippet (iii):

func (p *Provider) Delete(l *logger.Logger, vms vm.List) error {
	...
	client := compute.NewVirtualMachinesClient(*sub.ID)
	groupsClient := resources.NewGroupsClient(*sub.ID)
	...
	for _, vm := range vms {
		...
		if vm.EmptyCluster {
			future, err := groupsClient.Delete(ctx, resourceGroup)
		} else {
			future, err := client.Delete(ctx, parts.resourceGroup, parts.resourceName, nil)
		}
		...
	}
}

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DarrylWong, @smg260, and @srosenberg)


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, DarrylWong wrote…

We call [i] on badVMs, where p.Delete for azure is [ii].

When it calls client.Delete it errors out for empty clusters, since its expects a VM name to be passed in. Our VM in the case of an empty cluster is fake so azure won't find anything and return an error.

I suppose the logic of checking if a vm is fake and creating a resource group client to delete it could be moved into to azure.Delete if that makes it easier to understand. So maybe something like [iii]. This way we can keep just add the empty VMs to badVMs and have delete deal with it.

I think the reason why this is only a bad VMs GC issue is that this is the only place that tries to delete a single VM rather than the entire cluster.

I think I see what you're saying, but I'm still not sure I know the answer to my original question: if we just don't add the fake VM to the list of badVMs, wouldn't that be enough? We wouldn't attempt to delete that non-existent VM, and the resource group would ultimately be deleted by the existing logic of adding the cluster/resource group to the destroy list in *status.

E.g.:

	var badVMs vm.List
	for _, vm := range cloud.BadInstances {
		// We skip fake VMs and only delete "bad vms" if they were created more than 1h ago.
		if !vm.EmptyCluster && now.Sub(vm.CreatedAt) >= time.Hour {
			badVMs = append(badVMs, vm)
		}
	}

Copy link
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @smg260, and @srosenberg)


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I think I see what you're saying, but I'm still not sure I know the answer to my original question: if we just don't add the fake VM to the list of badVMs, wouldn't that be enough? We wouldn't attempt to delete that non-existent VM, and the resource group would ultimately be deleted by the existing logic of adding the cluster/resource group to the destroy list in *status.

E.g.:

	var badVMs vm.List
	for _, vm := range cloud.BadInstances {
		// We skip fake VMs and only delete "bad vms" if they were created more than 1h ago.
		if !vm.EmptyCluster && now.Sub(vm.CreatedAt) >= time.Hour {
			badVMs = append(badVMs, vm)
		}
	}

Ooooops completely missed the "not" in your last comment. Not adding it to the badVMs list does get rid of the error but it still never deletes it.

In ListCloud when we find a VM with an error, we continue which means it will never get added to the "good Vms" list only the BadInstances one (i). When we add clusters to the destroy list (ii), we only add clusters from cloud.Clusters, the "good vms" list.

I think easiest way to solve this would probably just be to throw another !vm.EmptyCluster in there like (iii). Seems to do the trick in my limited testing but it also defeats the purpose of appending those errors in the first place. But maybe that's fine since we're gonna delete it anyway and we can remove the parsing error check?

Code snippet (i):

// Anything with an error gets tossed into the BadInstances slice, and we'll correct
// the problem later on.
if len(v.Errors) > 0 {
	cloud.BadInstances = append(cloud.BadInstances, v)
	continue
}

Code snippet (ii):

for _, name := range names {
	c := cloud.Clusters[name]
	u := users[c.User]
	if u == nil {
		u = &status{}
		users[c.User] = u
	}
	s.add(c, now)
	u.add(c, now)
}

Code snippet (iii):

// Anything with an error gets tossed into the BadInstances slice, and we'll correct
// the problem later on.
if len(v.Errors) > 0 && !v.EmptyCluster{
	cloud.BadInstances = append(cloud.BadInstances, v)
	continue
}

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DarrylWong, @smg260, and @srosenberg)


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, DarrylWong wrote…

Ooooops completely missed the "not" in your last comment. Not adding it to the badVMs list does get rid of the error but it still never deletes it.

In ListCloud when we find a VM with an error, we continue which means it will never get added to the "good Vms" list only the BadInstances one (i). When we add clusters to the destroy list (ii), we only add clusters from cloud.Clusters, the "good vms" list.

I think easiest way to solve this would probably just be to throw another !vm.EmptyCluster in there like (iii). Seems to do the trick in my limited testing but it also defeats the purpose of appending those errors in the first place. But maybe that's fine since we're gonna delete it anyway and we can remove the parsing error check?

Aha! I see (sorry for the back and forth, it's the first time I'm reading GC and Azure code).

I think your suggestion makes sense -- avoid adding an error to the fake VM (with a comment explaining why) and skip fake VMs when aggregating badVMs. This definitely feel like a shoehorning exercise 😄

That said, the current badClusters approach is also fine by me, so feel free to keep as-is if you prefer.

Copy link
Contributor Author

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @smg260, and @srosenberg)


pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Aha! I see (sorry for the back and forth, it's the first time I'm reading GC and Azure code).

I think your suggestion makes sense -- avoid adding an error to the fake VM (with a comment explaining why) and skip fake VMs when aggregating badVMs. This definitely feel like a shoehorning exercise 😄

That said, the current badClusters approach is also fine by me, so feel free to keep as-is if you prefer.

No worries on the back and forth, helped expose a bug I completely missed 😃

After some more testing I think the "avoid adding errors" approach is what I'm gonna go for since it also lets you directly call roachprod destroy on the cluster whereas the badClusters approach you can't since the empty cluster gets placed in the badInstances slice. At least not without more confusing fake VM logic which I figure we wanna minimize.

Azure allows for clusters to exist with no attached VM, which
roachprod assumes is not possible. This would occur if azure.create
failed after creating a resource group but before creating a
VM. Roachprod GC only searches for VMs when searching for clusters,
so these VM-less clusters would never be found and deleted.

This change adds the ability to list empty clusters by querying
resource groups instead of VMs. This is used by Azure GC and
Azure destroy jobs to correctly identify and cleanup these dangling
resources.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-3373
Release note: None
Fixes: cockroachdb#58668
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @smg260 and @srosenberg)

@DarrylWong
Copy link
Contributor Author

bors r=renatolabs

@craig
Copy link
Contributor

craig bot commented Aug 31, 2023

Build succeeded:

@craig craig bot merged commit 06a506c into cockroachdb:master Aug 31, 2023
@DarrylWong
Copy link
Contributor Author

blathers backport 23.1 22.2

@blathers-crl
Copy link

blathers-crl bot commented Sep 1, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 8f90f93 to blathers/backport-release-23.1-109438: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.


error creating merge commit from 8f90f93 to blathers/backport-release-22.2-109438: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants