Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kube-prometheus-stack] update kps to v0.65.1 #3267

Closed
wants to merge 15 commits into from

Conversation

QuentinBisson
Copy link
Member

@QuentinBisson QuentinBisson commented Apr 20, 2023

What this PR does / why we need it

Upgrades prometheus-operator to v0.65.1

Which issue this PR fixes

(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged)

Special notes for your reviewer

Checklist

  • DCO signed
  • Chart Version bumped
  • Title of the PR starts with chart name (e.g. [prometheus-couchdb-exporter])

@monotek and @GMartinez-Sisti as it's my first major release, I'd like if you could take a look, especially as this one adds a new crd

Signed-off-by: QuentinBisson <[email protected]>
Signed-off-by: QuentinBisson <[email protected]>
@QuentinBisson
Copy link
Member Author

Well this will not be fun to fix:

Error: INSTALLATION FAILED: create: failed to create: Secret "sh.helm.release.v1.kube-prometheus-stack-l5sr2p6xtu.v1" is invalid: data: Too long: must have at most 1048576 bytes

@QuentinBisson
Copy link
Member Author

So the issue is coming from the newer PrometheusAgent CRD which is causing the helm secret to be too big.

The only solution I can think of is to have the crds outside the chart but maybe you can think of a better solution?

If that is fine with you i was thinking of doing something similar to https://github.com/cowboysysop/charts/tree/master/charts/vertical-pod-autoscaler/templates/crds a job that pulls and would then also update the crds

@zeritti
Copy link
Member

zeritti commented Apr 20, 2023

The only solution I can think of is to have the crds outside the chart but maybe you can think of a better solution?

An immediate/short-term solution might be using stripped-down crds as released by the Prometheus operator team. These do not include description fields, the manifests are considerably smaller.

@QuentinBisson
Copy link
Member Author

QuentinBisson commented Apr 20, 2023

That could work but this would reduce functionality as users would not be able to use kubectl explain on the CRDs but also, as you mention, this is only short-term. With the upcoming ScrapeConfig and maybe Silence CRD in prometheus operator, I know this issue will happen again "soon" so I would rather we have a proper fix :)

I also tend to think the job will help as upgrades of CRDs will then be at the charge of the helm chart :)

@zeritti
Copy link
Member

zeritti commented Apr 20, 2023

Yes, a job would do. Keeping the CRDs out of the chart seems to be the only fix in this respect.

@GMartinez-Sisti
Copy link
Member

Related #1876

Copy link
Member

@monotek monotek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check for the failing ci?

@QuentinBisson
Copy link
Member Author

@monotek i'm trying to fix it but the issue is that the New prometheus agent crd make us go over the 1MB secret limit of Kubernetes

@QuentinBisson
Copy link
Member Author

@GMartinez-Sisti and @monotek I tested this on existing clusters but this is quite a big change so I won't merge with newer approvals

Signed-off-by: Quentin Bisson <[email protected]>
- --server-side
- --force-conflicts={{ $.Values.crds.forceConflicts }}
- -f
- https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/{{ $appVersion }}/example/prometheus-operator-crd/monitoring.coreos.com_{{ . }}.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also allow this URL to be customisable as some environments might not have external internet access and this can be easily checked into an internal artefact repository.

@GMartinez-Sisti
Copy link
Member

@GMartinez-Sisti and @monotek I tested this on existing clusters but this is quite a big change so I won't merge with newer approvals

Ideally we would have CI automation to deploy the version on main and then the PR one over it. I can test this but it would take a couple of days. The logic looks good though, I like this approach 💪 .

Left some comments.

Signed-off-by: QuentinBisson <[email protected]>
## This defaults to non root user with uid 2000 and gid 2000. *v1.PodSecurityContext false
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
##
securityContext:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this could be made compatible with the restricted profile by default that would be great.

Signed-off-by: QuentinBisson <[email protected]>
Signed-off-by: Quentin Bisson <[email protected]>
@QuentinBisson
Copy link
Member Author

We really need to reach a concensus here because we need to bump the major version to avoid issues with prometheus-operator that was upgraded in another PR.

Either we:

  • Keep the CRD folders but do not add the newer ones as they are not causing the prometheus-operator to fail
  • Delete the CRD folder (fix the test) and explain to users that they need to install it themselves
  • Add the crd job present in this branch
  • Move something else in another chart (dashboards, rules, ...)?
  • Find another way forward.

@monotek I know you do not like that approach but would you have another one we can go with in your opinion? I would really like some help to move this forwards :(

@rouke-broersma
Copy link
Contributor

We really need to reach a concensus here because we need to bump the major version to avoid issues with prometheus-operator that was upgraded in another PR.

Personally I'm fine with any solution because I already don't use this chart for CRDs. I would say that moving the dashboards out of this chart is probably not an alternative but rather a necessary next step instead because I imagine that sooner or later this limit will be hit again if the dashboards remain in this chart.

@zeritti
Copy link
Member

zeritti commented May 16, 2023

I'd suggest adding thanosrulers/status in resources in the operator's clusterrole template, the subresource is new in 0.65.0.

@onedr0p
Copy link
Contributor

onedr0p commented May 16, 2023

One thing to note is that Argo doesn't implement helm properly meaning the crds option present currently in this PR is something that Argo folks cannot use because they are implemented with a helm hook. I'm not using Argo so I'm not sure how important that is or if Argo users have a different means of managing crds that can keep up to date with this helm chart.

@rouke-broersma
Copy link
Contributor

One thing to note is that Argo doesn't implement helm properly meaning the crds option present currently in this PR is something that Argo folks cannot use because they are implemented with a helm hook. I'm not using Argo so I'm not sure how important that is or if Argo users have a different means of managing crds that can keep up to date with this helm chart.

Argo converts helm hooks to Argo hooks, pre-install and preupgrade become 'presync' which should be fine in this case I believe.

@xiu
Copy link
Contributor

xiu commented May 19, 2023

We really need to reach a concensus here because we need to bump the major version to avoid issues with prometheus-operator that was upgraded in another PR.

Keep the CRD folders but do not add the newer ones as they are not causing the prometheus-operator to fail

This could be an option as prometheus-operator doesn't need either ScrapeConfig or PrometheusAgent to start. It just pushes the problem further in the future though.

Delete the CRD folder (fix the test) and explain to users that they need to install it themselves

This would be my preferred option to keep things simple. People can use prometheus-operator-crds to install the CRDs.

@jkroepke
Copy link
Member

I'm for for deleting the CRDs from kube-prometheus-stack and shift the maintenance to prometheus-operator-crd helm chart. This also resolve issue that kubectl needs manually applied on major updates.

However, please keep in mind, removing the CRDs from crds/ folder does not remove the CRD from cluster.

This can cause can conflict that prometheus-operator-crd can't installed on existing clusters. We have to add a documentation to "migrate" the CRDs from kube-prometheus-stack to prometheus-operator-crd, for example that the CRDs needs to be annotated with meta.helm.sh/release-name: <RELEASE_NAME> and meta.helm.sh/release-namespace: <RELEASE_NAMESPACE>

@onedr0p
Copy link
Contributor

onedr0p commented May 22, 2023

@jkroepke I am not happy with that solution because of what I mentioned here[1] and my proceeding 3 comments. Flux is able to manage the CRDs just fine in the special crds folders.

[1] #3267 (comment)

@jkroepke
Copy link
Member

jkroepke commented May 22, 2023

And what about non-flux users? I would personally not care about ArgoCD or flux here. Nice to see that the GitOps operators has some special feature to just fix helm, but we should looking forward for typical helm users. As I know, ArgoCD solves the CRDs issues with helm template | kubectl apply

@onedr0p you can create a umbrella chart to bundle both helm chart to have the old behavior back.

@onedr0p
Copy link
Contributor

onedr0p commented May 22, 2023

And what about non-flux users?

Documentation still stays the same, kubectl apply the crds on upgrade. I think we've found the major issue here, being the grafana dashboards having the largest size[1].

[1] #3267 (comment)

Even if we take out the crds, leaving the dashboards in this chart will only lead to this issue down the road as more things are added.

@jkroepke
Copy link
Member

I think we've found the major issue here, being the grafana dashboards having the largest size[1].

Yeah. I created #3416 as POC which minified all the dashboard json files.

From

1.6M    ../templates/grafana/dashboards-1.14

down to

688K    ../templates/grafana/dashboards-1.14

which can have a potential save of 1MB data (or 1MB of pure whitespaces)...

@rouke-broersma
Copy link
Contributor

I think we've found the major issue here, being the grafana dashboards having the largest size[1].

Yeah. I created #3416 as POC which minified all the dashboard json files.

From

1.6M    ../templates/grafana/dashboards-1.14

down to

688K    ../templates/grafana/dashboards-1.14

which can have a potential save of 1MB data (or 1MB of pure whitespaces)...

#3267 (comment)

Helm already compresses all manifests inside the secret so not sure if minifying the json helps anything. Needs to be properly tested.

@jkroepke
Copy link
Member

Helm already compresses all manifests inside the secret so not sure if minifying the json helps anything. Needs to be properly tested.

Everyone is invited to test #3416, the CI is at least green.

@rouke-broersma
Copy link
Contributor

Helm already compresses all manifests inside the secret so not sure if minifying the json helps anything. Needs to be properly tested.

Everyone is invited to test #3416, the CI is at least green.

Ah I missed that you posted the secret sizes sorry, looks good! Tho 979K is still dangerously close to the limit, a couple more crds (which some have commented are already planned) and we're back there.

@jkroepke
Copy link
Member

I agree, that this is not a long term solution. We may need to discuss how to proceed there. However there a short-term fix for now to proceed as usual.

I do not really understand, why CRDs are stored in a helm release while helm does not managed them. The helm guys are crazy.

@jkroepke
Copy link
Member

By the way, on a another test branch:

Minify the CRDs (convert them from YAML to JSON) would also reduce the size to 811K in total.

% kubectl get secrets sh.helm.release.v1.kube-prometheus-stack.v1 -o jsonpath='{.data.release}' | base64 -d > release
% ls -lah release                                                                                                    
-rw-r--r--@ 1 jok  staff   811K May 23 09:11 release

While this is fine from Helm/Kubernetes point of view, I'm not aware of the downsides by deliver JSON manifests

@QuentinBisson
Copy link
Member Author

This will be closed in favor of #3416 but I will keep it open to make this into an issue to define the way forward

Copy link
Member

@monotek monotek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set to "request changes" to block accidental merging it as changes are already implemented in #3416

@jkroepke
Copy link
Member

jkroepke commented Jun 2, 2023

Hi there, would it make sense to create an issue instead discuss in a PR? It feels like this topic is in sleep mode for know. I'm interrest into finding a solution before having the issues on new CRDs again.

@QuentinBisson
Copy link
Member Author

@jkroepke yes it would but I did not have the bandwith to write the issue down yet, you can do it if you want :)

@jkroepke
Copy link
Member

jkroepke commented Jul 3, 2023

@jkroepke yes it would but I did not have the bandwith to write the issue down yet, you can do it if you want :)

#3548 created. @QuentinBisson

@jkroepke jkroepke closed this Jul 3, 2023
@QuentinBisson
Copy link
Member Author

Thank you @jkroepke :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[kube-prometheus-stack] update to 0.64
10 participants