roachprod: azure gc cleans up empty clusters #109438

DarrylWong · 2023-08-24T16:53:11Z

Azure allows for clusters to exist with no attached VM, which roachprod assumes is not possible. This would occur if azure.create failed after creating a resource group but before creating a VM. Roachprod GC only searches for VMs when searching for clusters, so these VM-less clusters would never be found and cleaned up.

This change adds the ability to list empty clusters by querying resource groups instead of VMs. This is used by GC and destroy jobs to correctly identify and cleanup these dangling resources.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-3373
Release note: None

cockroach-teamcity · 2023-08-24T16:53:20Z

This change is

renatolabs

Nice work! Thanks for fixing this longstanding issue 🎉

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @smg260, and @srosenberg)

-- commits line 16 at r1:
Should we also include:

Fixes: #58668

?

pkg/roachprod/cloud/cluster_cloud.go line 237 at r1 (raw file):

	for _, c := range cloud.Clusters {
		if len(c.VMs) == 0 {
			return nil, errors.Errorf("found no VMs in Cluster %s", c.Name)

Nittest of nits: cluster (no need for capitalization)

pkg/roachprod/cloud/gc.go line 44 at r1 (raw file):

	exp := c.ExpiresAt()
	// Clusters without VMs shouldn't exist and are likely dangling resources.
	if c.VMs[0].EmptyCluster {

It wouldn't be a bad idea to make this check a method on Cluster to contain the knowledge of VMs[0].EmptyCluster. It's a bit of a workaround approach, so if we want to change how we deal with this situation later, the encapsulation will help.

pkg/roachprod/vm/vm.go line 127 at r1 (raw file):

	// EmptyCluster indicates that the VM does not exist. Certain providers allow for empty
	// clusters, but roachprod does not allow VM-less clusters except when deleting them.
	// A fake VM may be used in this scenario.

Some comments on the doc:

would be good to call out that this only applies to Azure (instead of "certain providers")
a fake VM will be used in this case.

pkg/roachprod/vm/azure/azure.go line 554 at r1 (raw file):

		// Clusters that have a VM were already found above. We just want
		// to add the VM-less clusters.
		foundClusters := make(map[string]bool)

Might as well keep track of these clusters while we're iterating over VMs in the loop above. The (very) small memory footprint is justified by the removal of this loop, IMO.

pkg/roachprod/vm/azure/azure.go line 558 at r1 (raw file):

			clusterName, err := v.ClusterName()
			if err != nil {
				v.Errors = append(v.Errors, vm.ErrInvalidName)

If you do decide to keep this loop, I don't think we need to do this; if anything, this is already done by the caller (ListCloud)

pkg/roachprod/vm/azure/azure.go line 564 at r1 (raw file):

		for it.NotDone() {
			found := it.Value()

Nit: maybe a better name would be resourceGroup? I see that the loop above calls it found, but I kept thinking "what was found again?" when reviewing the code.

pkg/roachprod/vm/azure/azure.go line 576 at r1 (raw file):

			}

			if _, ok := found.Tags[vm.TagRoachprod]; !ok {

We should probably move this check to the very top of the loop.

pkg/roachprod/vm/azure/azure.go line 585 at r1 (raw file):

			tags := make(map[string]string)
			for key, value := range found.Tags {
				tags[key] = *value

Is there any code that relies on these tags existing?

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

			} else {
				m.Errors = append(m.Errors, vm.ErrNoExpiration)
			}

In the case of IncludeEmptyClusters, these parse errors look to me like they should be hard errors (i.e., returned to the caller), since we'll rely on them during GC, no?

DarrylWong

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)

-- commits line 16 at r1:

Previously, renatolabs (Renato Costa) wrote…

Should we also include:
Fixes: #58668
?

Done.

pkg/roachprod/cloud/cluster_cloud.go line 237 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nittest of nits: cluster (no need for capitalization)

Done.

pkg/roachprod/cloud/gc.go line 44 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

It wouldn't be a bad idea to make this check a method on Cluster to contain the knowledge of VMs[0].EmptyCluster. It's a bit of a workaround approach, so if we want to change how we deal with this situation later, the encapsulation will help.

Done.

pkg/roachprod/vm/vm.go line 127 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Some comments on the doc:

would be good to call out that this only applies to Azure (instead of "certain providers")

a fake VM will be used in this case.

Done.

pkg/roachprod/vm/azure/azure.go line 554 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Might as well keep track of these clusters while we're iterating over VMs in the loop above. The (very) small memory footprint is justified by the removal of this loop, IMO.

Done.

pkg/roachprod/vm/azure/azure.go line 558 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

If you do decide to keep this loop, I don't think we need to do this; if anything, this is already done by the caller (ListCloud)

Done.

pkg/roachprod/vm/azure/azure.go line 576 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

We should probably move this check to the very top of the loop.

Done.

pkg/roachprod/vm/azure/azure.go line 585 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Is there any code that relies on these tags existing?

Nope, removed.

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

In the case of IncludeEmptyClusters, these parse errors look to me like they should be hard errors (i.e., returned to the caller), since we'll rely on them during GC, no?

If we make it a hard error won't GC error out too (since it calls ListCloud) and be unable to clear up the empty clusters anyway? It would be unable to clean up any clusters in general even if they have a correctly parsed expiration time just because one empty cluster doesn't. It would also remove the ability for the user to just call roachprod destroy directly on the cluster.

Would logging a warning telling the user that they might have to manually destroy the cluster be a better alternative?

renatolabs

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @smg260, and @srosenberg)

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, DarrylWong wrote…

If we make it a hard error won't GC error out too (since it calls ListCloud) and be unable to clear up the empty clusters anyway? It would be unable to clean up any clusters in general even if they have a correctly parsed expiration time just because one empty cluster doesn't. It would also remove the ability for the user to just call roachprod destroy directly on the cluster.

Would logging a warning telling the user that they might have to manually destroy the cluster be a better alternative?

Yeah, you're right, there wouldn't be much we could do in this case. If we were indeed to hit this case, c.CreatedAt would be empty and GC would destroy the resource group (as it should).

DarrylWong

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Yeah, you're right, there wouldn't be much we could do in this case. If we were indeed to hit this case, c.CreatedAt would be empty and GC would destroy the resource group (as it should).

Ahh that is something I didn't think to test, how GC would deal with malformed tags/vm names. Just tested it and it tries to delete each VM individually which obviously doesn't work in this case since the VM is fake for empty clusters. Ends up trying to delete a VM that doesn't exist and erroring out. Adding some logic to treat these as clusters.

renatolabs

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DarrylWong, @smg260, and @srosenberg)

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, DarrylWong wrote…

Ahh that is something I didn't think to test, how GC would deal with malformed tags/vm names. Just tested it and it tries to delete each VM individually which obviously doesn't work in this case since the VM is fake for empty clusters. Ends up trying to delete a VM that doesn't exist and erroring out. Adding some logic to treat these as clusters.

Not sure I get it -- why do we need the new badClusters logic? Wouldn't it be sufficient to to not add the VM to badVMs when vm.EmptyCluster?

DarrylWong

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @smg260, and @srosenberg)

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Not sure I get it -- why do we need the new badClusters logic? Wouldn't it be sufficient to to not add the VM to badVMs when vm.EmptyCluster?

We call [i] on badVMs, where p.Delete for azure is [ii].

When it calls client.Delete it errors out for empty clusters, since its expects a VM name to be passed in. Our VM in the case of an empty cluster is fake so azure won't find anything and return an error.

I suppose the logic of checking if a vm is fake and creating a resource group client to delete it could be moved into to azure.Delete if that makes it easier to understand. So maybe something like [iii]. This way we can keep just add the empty VMs to badVMs and have delete deal with it.

I think the reason why this is only a bad VMs GC issue is that this is the only place that tries to delete a single VM rather than the entire cluster.

Code snippet (i):

if len(badVMs) > 0 {
	// Destroy bad VMs.
	err := vm.FanOut(badVMs, func(p vm.Provider, vms vm.List) error {
		return p.Delete(l, vms)
	})
	if err != nil {
		postError(l, client, channel, err)
	}
}

Where p.Delete in azure is:

Code snippet (ii):

func (p *Provider) Delete(l *logger.Logger, vms vm.List) error {
	...
	client := compute.NewVirtualMachinesClient(*sub.ID)
	...
	for _, vm := range vms {
		...
		future, err := client.Delete(ctx, parts.resourceGroup, parts.resourceName, nil)
		...
	}
}

Code snippet (iii):

func (p *Provider) Delete(l *logger.Logger, vms vm.List) error {
	...
	client := compute.NewVirtualMachinesClient(*sub.ID)
	groupsClient := resources.NewGroupsClient(*sub.ID)
	...
	for _, vm := range vms {
		...
		if vm.EmptyCluster {
			future, err := groupsClient.Delete(ctx, resourceGroup)
		} else {
			future, err := client.Delete(ctx, parts.resourceGroup, parts.resourceName, nil)
		}
		...
	}
}

renatolabs

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DarrylWong, @smg260, and @srosenberg)

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, DarrylWong wrote…

We call [i] on badVMs, where p.Delete for azure is [ii].

When it calls client.Delete it errors out for empty clusters, since its expects a VM name to be passed in. Our VM in the case of an empty cluster is fake so azure won't find anything and return an error.

I suppose the logic of checking if a vm is fake and creating a resource group client to delete it could be moved into to azure.Delete if that makes it easier to understand. So maybe something like [iii]. This way we can keep just add the empty VMs to badVMs and have delete deal with it.

I think the reason why this is only a bad VMs GC issue is that this is the only place that tries to delete a single VM rather than the entire cluster.

I think I see what you're saying, but I'm still not sure I know the answer to my original question: if we just don't add the fake VM to the list of badVMs, wouldn't that be enough? We wouldn't attempt to delete that non-existent VM, and the resource group would ultimately be deleted by the existing logic of adding the cluster/resource group to the destroy list in *status.

E.g.:

	var badVMs vm.List
	for _, vm := range cloud.BadInstances {
		// We skip fake VMs and only delete "bad vms" if they were created more than 1h ago.
		if !vm.EmptyCluster && now.Sub(vm.CreatedAt) >= time.Hour {
			badVMs = append(badVMs, vm)
		}
	}

DarrylWong

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @smg260, and @srosenberg)

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I think I see what you're saying, but I'm still not sure I know the answer to my original question: if we just don't add the fake VM to the list of badVMs, wouldn't that be enough? We wouldn't attempt to delete that non-existent VM, and the resource group would ultimately be deleted by the existing logic of adding the cluster/resource group to the destroy list in *status.

E.g.:
	var badVMs vm.List
	for _, vm := range cloud.BadInstances {
		// We skip fake VMs and only delete "bad vms" if they were created more than 1h ago.
		if !vm.EmptyCluster && now.Sub(vm.CreatedAt) >= time.Hour {
			badVMs = append(badVMs, vm)
		}
	}

Ooooops completely missed the "not" in your last comment. Not adding it to the badVMs list does get rid of the error but it still never deletes it.

In ListCloud when we find a VM with an error, we continue which means it will never get added to the "good Vms" list only the BadInstances one (i). When we add clusters to the destroy list (ii), we only add clusters from cloud.Clusters, the "good vms" list.

I think easiest way to solve this would probably just be to throw another !vm.EmptyCluster in there like (iii). Seems to do the trick in my limited testing but it also defeats the purpose of appending those errors in the first place. But maybe that's fine since we're gonna delete it anyway and we can remove the parsing error check?

Code snippet (i):

// Anything with an error gets tossed into the BadInstances slice, and we'll correct
// the problem later on.
if len(v.Errors) > 0 {
	cloud.BadInstances = append(cloud.BadInstances, v)
	continue
}

Code snippet (ii):

for _, name := range names {
	c := cloud.Clusters[name]
	u := users[c.User]
	if u == nil {
		u = &status{}
		users[c.User] = u
	}
	s.add(c, now)
	u.add(c, now)
}

Code snippet (iii):

// Anything with an error gets tossed into the BadInstances slice, and we'll correct
// the problem later on.
if len(v.Errors) > 0 && !v.EmptyCluster{
	cloud.BadInstances = append(cloud.BadInstances, v)
	continue
}

renatolabs

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DarrylWong, @smg260, and @srosenberg)

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, DarrylWong wrote…

Ooooops completely missed the "not" in your last comment. Not adding it to the badVMs list does get rid of the error but it still never deletes it.

In ListCloud when we find a VM with an error, we continue which means it will never get added to the "good Vms" list only the BadInstances one (i). When we add clusters to the destroy list (ii), we only add clusters from cloud.Clusters, the "good vms" list.

I think easiest way to solve this would probably just be to throw another !vm.EmptyCluster in there like (iii). Seems to do the trick in my limited testing but it also defeats the purpose of appending those errors in the first place. But maybe that's fine since we're gonna delete it anyway and we can remove the parsing error check?

Aha! I see (sorry for the back and forth, it's the first time I'm reading GC and Azure code).

I think your suggestion makes sense -- avoid adding an error to the fake VM (with a comment explaining why) and skip fake VMs when aggregating badVMs. This definitely feel like a shoehorning exercise 😄

That said, the current badClusters approach is also fine by me, so feel free to keep as-is if you prefer.

DarrylWong

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @smg260, and @srosenberg)

pkg/roachprod/vm/azure/azure.go line 620 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Aha! I see (sorry for the back and forth, it's the first time I'm reading GC and Azure code).

I think your suggestion makes sense -- avoid adding an error to the fake VM (with a comment explaining why) and skip fake VMs when aggregating badVMs. This definitely feel like a shoehorning exercise 😄

That said, the current badClusters approach is also fine by me, so feel free to keep as-is if you prefer.

No worries on the back and forth, helped expose a bug I completely missed 😃

After some more testing I think the "avoid adding errors" approach is what I'm gonna go for since it also lets you directly call roachprod destroy on the cluster whereas the badClusters approach you can't since the empty cluster gets placed in the badInstances slice. At least not without more confusing fake VM logic which I figure we wanna minimize.

Azure allows for clusters to exist with no attached VM, which roachprod assumes is not possible. This would occur if azure.create failed after creating a resource group but before creating a VM. Roachprod GC only searches for VMs when searching for clusters, so these VM-less clusters would never be found and deleted. This change adds the ability to list empty clusters by querying resource groups instead of VMs. This is used by Azure GC and Azure destroy jobs to correctly identify and cleanup these dangling resources. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-3373 Release note: None Fixes: cockroachdb#58668

renatolabs

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @smg260 and @srosenberg)

DarrylWong · 2023-08-31T20:01:48Z

bors r=renatolabs

craig · 2023-08-31T22:24:44Z

Build succeeded:

Bazel Essential CI (Cockroach)

DarrylWong · 2023-09-01T14:32:08Z

blathers backport 23.1 22.2

blathers-crl · 2023-09-01T14:32:14Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 8f90f93 to blathers/backport-release-23.1-109438: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.

error creating merge commit from 8f90f93 to blathers/backport-release-22.2-109438: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

DarrylWong marked this pull request as ready for review August 24, 2023 16:53

DarrylWong requested a review from a team as a code owner August 24, 2023 16:53

DarrylWong requested review from srosenberg and smg260 and removed request for a team August 24, 2023 16:53

DarrylWong linked an issue Aug 24, 2023 that may be closed by this pull request

roachprod: azure gc is broken in specific cases #58668

Closed

DarrylWong removed a link to an issue Aug 24, 2023

roachprod: azure gc is broken in specific cases #58668

Closed

This was linked to issues Aug 24, 2023

roachprod: azure gc is broken in specific cases #58668

Closed

roachprod: azure.Create should clean up in case of provisioning error #77622

Closed

renatolabs reviewed Aug 29, 2023

View reviewed changes

DarrylWong commented Aug 30, 2023

View reviewed changes

DarrylWong force-pushed the azure-gc branch from 5a7232d to 83070f0 Compare August 30, 2023 15:17

renatolabs approved these changes Aug 30, 2023

View reviewed changes

DarrylWong commented Aug 30, 2023

View reviewed changes

DarrylWong force-pushed the azure-gc branch from 83070f0 to 5ea200e Compare August 30, 2023 20:13

renatolabs reviewed Aug 30, 2023

View reviewed changes

DarrylWong commented Aug 30, 2023

View reviewed changes

renatolabs reviewed Aug 31, 2023

View reviewed changes

DarrylWong commented Aug 31, 2023

View reviewed changes

renatolabs approved these changes Aug 31, 2023

View reviewed changes

DarrylWong force-pushed the azure-gc branch from 5ea200e to a1b217c Compare August 31, 2023 17:44

DarrylWong commented Aug 31, 2023

View reviewed changes

DarrylWong force-pushed the azure-gc branch from a1b217c to 8f90f93 Compare August 31, 2023 18:03

renatolabs approved these changes Aug 31, 2023

View reviewed changes

craig bot merged commit 06a506c into cockroachdb:master Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod: azure gc cleans up empty clusters #109438

roachprod: azure gc cleans up empty clusters #109438

DarrylWong commented Aug 24, 2023

cockroach-teamcity commented Aug 24, 2023

renatolabs left a comment

DarrylWong left a comment

renatolabs left a comment

DarrylWong left a comment

renatolabs left a comment

DarrylWong left a comment

renatolabs left a comment

DarrylWong left a comment

renatolabs left a comment

DarrylWong left a comment

renatolabs left a comment

DarrylWong commented Aug 31, 2023

craig bot commented Aug 31, 2023

DarrylWong commented Sep 1, 2023

blathers-crl bot commented Sep 1, 2023

roachprod: azure gc cleans up empty clusters #109438

roachprod: azure gc cleans up empty clusters #109438

Conversation

DarrylWong commented Aug 24, 2023

cockroach-teamcity commented Aug 24, 2023

renatolabs left a comment

Choose a reason for hiding this comment

DarrylWong left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

DarrylWong left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

DarrylWong left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

DarrylWong left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

DarrylWong left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

DarrylWong commented Aug 31, 2023

craig bot commented Aug 31, 2023

DarrylWong commented Sep 1, 2023

blathers-crl bot commented Sep 1, 2023