Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bool in HelmChartSpec.items.spec.set can stall controller queue processing #172

Closed
brandond opened this issue Jan 26, 2023 · 8 comments
Closed

Comments

@brandond
Copy link
Member

From @Stargeras in rancher/rke2#3849:

I updated a set in one of my manifests to have a boolean value (true). It's weird that it affected updates to all of the HelmChart resources and not just the one with the boolean.
I do not see the Job getting updated when the HelmChart resource is updated. I don't see any changes in the kube-controller-manager pod logs. I do see something interesting in the rke2 server logs though. I see the event for the updated manifest but then I see the following error:

Jan 26 16:15:45 pserver24 rke2[30949]: I0126 16:15:45.104127 30949 event.go:294] "Event occurred" object="kube-system/rancher-chart" fieldPath="" kind="Addon" apiVersion="k3s.cattle.io/v1" type="Normal" reason="AppliedManifest" message="Applied manifest at "/var/lib/rancher/rke2/server/manifests/rancher-chart.yaml""
Jan 26 16:16:05 pserver24 rke2[30949]: W0126 16:16:05.361337 30949 reflector.go:324] k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.HelmChart: json: cannot unmarshal bool into Go struct field HelmChartSpec.items.spec.set of type int32
Jan 26 16:16:05 pserver24 rke2[30949]: E0126 16:16:05.361389 30949 reflector.go:138] k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.HelmChart: failed to list *v1.HelmChart: json: cannot unmarshal bool into Go struct field HelmChartSpec.items.spec.set of type int32
@dattito
Copy link

dattito commented Apr 16, 2023

Does anyone now how do unstall the controller? It can't do any other thing because of the stalling. Annoying. How can't this be fixed in such a long time?

@brandond
Copy link
Member Author

brandond commented Apr 17, 2023

Uninstall it from what?

It is not fixed because it is triggered by bad input from the user; the fix is to correct your manifest so that it does not have invalid field content. Adding an admission webhook to validate resources when they are created is more work than we current want to take on.

@gbonnefille
Copy link
Contributor

@brandond you are right, this is an issue from the user. Nevertheless, as far as I remember, the problem is that the whole HelmChart processing stop when such « syntax error » occurs, for any HelmChart. And the error message is accessible only on the /var/log/k3s.log.

For a user, it would be helpful if the error is reported on the HelmChart itself, via the Status or an event.

@brandond
Copy link
Member Author

brandond commented Apr 18, 2023

The error itself that stalls the queue is deep inside the Kubernetes code that handles unmarshalling the list/watch event from the channel; I suspect that fixing it on that side would be quite involved. The proper fix is probably to validate the resources before they're admitted to the cluster but that is also a lot of work.

it would be helpful if the error is reported on the HelmChart itself, via the Status or an event.

Yeah but we can't do that, as the resource is invalid. It couldn't be decoded so we don't know which one it is to attach an event to it.

@brandond
Copy link
Member Author

brandond commented Apr 18, 2023

Honestly, just having the controller generate CRDs a proper OpenAPI field spec instead of using x-kubernetes-preserve-unknown-fields: true would probably be the correct way to handle this.

@brandond
Copy link
Member Author

It looks like the standalone controller already did that, it's just K3s and RKE2 that register the CRDs without schema. I have a PR open to fix that.

@brandond
Copy link
Member Author

Closing here, as the standalone controller already enables schema validation via its CRDs. It is just that K3s/RKE2 do not create the CRDs with schema, so they allow invalid resources to be created. I have opened an issue and PR to address that.

@gbonnefille
Copy link
Contributor

Thanks @brandond for this analysis and report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants