-
Notifications
You must be signed in to change notification settings - Fork 41
Improve release name uniqueness and ownership #57
Improve release name uniqueness and ownership #57
Conversation
helm-app-operator/pkg/helm/helm.go
Outdated
if err != nil { | ||
return r, needsUpdate, fmt.Errorf("install error: %s", err) | ||
} | ||
needsUpdate = true | ||
logrus.Infof("Installed release for %s release=%s", ResourceString(r), updatedRelease.GetName()) | ||
|
||
} else if status.Release == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joelanford I'm wondering if this scenario is possible:
- There is no deployed release, so we install the release
c.installRelease()
- We fail to update the
status.Release
- Retry the reconcile loop and this time
c.storageBackend.Deployed()
returnslatestRelease, nil
. status.Release==nil
from our last failed status update and we return an error thinking this release is not owned by the CR(even though it actually is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that definitely sounds possible. Off the top of my head, I can think of a few things we can try:
- If we fail to update, try to rollback the release to whatever the previous
status.Release
was, or uninstall if it was nil. Depending on the error, the rollback may also fail. - Rather than using
c.storageBackend.Deployed()
as the latestRelease, usestatus.Release
. - Keep an in-memory map of resource UUID to release name.
- See if there's a way to store the resource UUID in the release (maybe with the chart values), such that
c.storageBackend.Deployed()
would "know" its owner.
Each of these may cause other interesting failure scenarios though, so I'd have to experiment to know whether any of these are feasible.
Any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if there's a way to store the resource UUID in the release (maybe with the chart values), such that c.storageBackend.Deployed() would "know" its owner.
Turns out it's pretty easy to add an arbitrary config value to the release, which gets saved with the release in the storage backend. It seems like this is probably more robust than the other options. The latest commit changes the ownership checks to use this method instead.
helm-app-operator/pkg/helm/helm.go
Outdated
return fmt.Sprintf("%s-%s", operatorName, r.GetName()) | ||
func getReleaseName(r *unstructured.Unstructured) string { | ||
status := v1alpha1.StatusFor(r) | ||
if status.Release != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joelanford The implementation seems fine to me but I'm thinking we should try our best to avoid relying on Status for our control logic. That seems to be the upstream convention as well.
https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#spec-and-status
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/principles.md#api
Object status must be 100% reconstructable by observation. Any history kept must be just an optimization and not required for correct operation.
The valueResourceUID
is useful in letting us verify the correct owner of a pre-existing release.
But we still end up storing the release name in the Status to look up again in the future.
Right now with this implementation what happens if:
- We deploy a new release with no release name, which generates a new unique release name e.g "foo"
c.installRelease(tiller, r.GetNamespace(), "", chart, config)
- Fail to update the status with
updatedRelease=foo
- Retry the reconcile. See that
status.Release
is empty from our last failed attempt. - Retry deploying the release with no release name
c.installRelease(tiller, r.GetNamespace(), "", chart, config)
- Does that give us a new deployed release with a new unique name
updatedRelease=bar
?
Ideally we could make the UUID part of the release name e.g cr.name+UUID
to make it fully unique which avoids the problem of saving the release name since we can reconstruct it just by looking at the CR. And that solves the ownership problem.
But there's the issue of the 53 char limit. I know UUID's are 36 chars so we could stay within the limit but I'm wondering how human readable do release names need to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. That would result in a separate new release.
Thanks for linking the conventions and principles. That's good info!
I did think about using UUIDs in the release name, but there are some downsides:
- As you point out, UUIDs would take up 36 of the 53 characters, leaving only 17 if we wanted to include the CR name. That may be cutting it a little too close for comfort.
- Release names end up in the generated resource names for all helm chart resources, so users would end up with really long names for everything managed by the helm operator.
We could do something like cr.Name + last_N_chars(cr.UUID)
, where N is < 36. Assuming the UUIDs are uniformally random hex characters, the probability of collision would likely be sufficiently low at N=8.
If we want better human readability, another option could be that we add the release name to the CR (somewhere in metadata
or spec
maybe?) prior to any attempt to invoke the release installation, update, etc. That way a CR never results in any deployed release until it has been updated with a release name.
Increases uniqueness of release name while also minimizing complexity that would arise with generating and tracking release name ownership across CR updates and reconciliation. Removes release name annotations, since it would be unnecessarily complex to handle a release name change due to a new or updated annotation value.
Reverted ownership checks and updated I also removed the annotations since it would be difficult to handle release name changes in the case where the annotations are added or updated after a release has been made. |
helm-app-operator/pkg/helm/helm.go
Outdated
func releaseName(r *unstructured.Unstructured) string { | ||
return fmt.Sprintf("%s-%s", operatorName, r.GetName()) | ||
func getReleaseName(r *unstructured.Unstructured) string { | ||
shortUID := strings.Replace(string(r.GetUID()), "-", "", -1)[:8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although perhaps unlikely we could still get collisions with just the last 8 chars of the UUID.
I think some parts of the UUID are usually constant based on the implementation.
We could reduce those odds by base64 encoding the UUID from 32 chars to 22 chars.
E.g https://gist.github.com/hasbro17/201d727ca6402c067dd399a8a059c32c
based on https://medium.com/@ablab/url-safe-uuid-encoding-in-go-4ba3b4ce4206
Or hashing the 32 chars(minus 4 hyphens) down to 8 or 10 chars. That might be a little safer than just lopping off the last 8 chars.
So we can explore the best option but at least add a TODO for now on how we could improve the short UID generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Kubernetes uses UUID v1, which uses the a combination of the clock and MAC address of the host:
- https://github.com/kubernetes/apimachinery/blob/master/pkg/util/uuid/uuid.go#L33
- https://github.com/pborman/uuid/blob/master/version1.go#L11-L23
I spun up a deployment with two replicas and compared their UIDs:
$ kubectl get pods -o jsonpath={..metadata.uid} -l release=app1-bf7534ac | tr " " "\n"
bf852ded-d7ea-11e8-ab5b-88c3b95d461f
bf85debb-d7ea-11e8-ab5b-88c3b95d461f
Only the first 8 chars differ, so my initial stab (actually using the first 8, not last 8) was complete luck. I'm thinking this is worth fixing before it's integrated into the SDK so we don't have to deal with users' release names changing when they upgrade their base helm operator image.
I like base64 idea, so I'll give that a go. That would give users 31 characters for the CR name, which seems like enough. I'll test what happens with a long CR name and add an additional check if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out k8s doesn't like uppercase letters in resource names, so the base64 approach didn't work:
ERRO[0003] failed to reconcile release for apiVersion=apache.org/v1alpha1 kind=Tomcat name=default/app1: install error: release app1-v3U0rNfqEeirW4jDuV1GHw failed: Service "app1-v3U0rNfqEeirW4jDuV1GHw-tomcat" is invalid: metadata.name: Invalid value: "app1-v3U0rNfqEeirW4jDuV1GHw-tomcat": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name', or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')
I switched to a base36 encoding (lowercase letters and digits 0-9), which works. It increases the length to 25 characters, so still not too bad.
INFO[0235] Reconciled release for apiVersion=apache.org/v1alpha1 kind=Tomcat name=default/app1 release=app1-41arrw2nar998e3vumtsue9lr
The resulting resource names are a bit long, but maybe still within the realm of reasonableness:
$ kubectl get all -l release=app1-41arrw2nar998e3vumtsue9lr
NAME READY STATUS RESTARTS AGE
pod/app1-41arrw2nar998e3vumtsue9lr-tomcat-75b55c8855-9frcr 2/2 Running 0 5m
pod/app1-41arrw2nar998e3vumtsue9lr-tomcat-75b55c8855-qv966 0/2 Pending 0 5m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/app1-41arrw2nar998e3vumtsue9lr-tomcat ClusterIP 10.109.202.233 <none> 80/TCP 5m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.apps/app1-41arrw2nar998e3vumtsue9lr-tomcat 2 2 2 1 5m
NAME DESIRED CURRENT READY AGE
replicaset.apps/app1-41arrw2nar998e3vumtsue9lr-tomcat-75b55c8855 2 2 1 5m
For release names that are too long, the operator prints a helpful error message:
ERRO[0000] failed to reconcile release for apiVersion=apache.org/v1alpha1 kind=Tomcat name=default/my-super-long-name-for-a-tomcat: install error: release name "my-super-long-name-for-a-tomcat-86jzwwlaodwvoiarjie4v87bz" exceeds max length of 53
As a separate bit of work, I don't think we're actually updating the CR status when we fail, so I'll add an issue for that.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's follow up on the status update in a separate PR/issue.
And there might be a way to shorten the UUID further maybe by hashing it. But this is good enough for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after nit
Description of change
This PR updates the way the release name is determined for a CR, first by checking new annotations that allow users to customize the release name. If no annotations are set, it falls back to using a tiller-provided release name, which is guaranteed to be unique.
The PR also adds checks to ensure that a CR "owns" a release name before making changes for that release. This prevents two CRs from updating each others releases and prevents a CR from uninstalling another CR's release.
Motivation for change
See #49 for details