Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prow monitoring through kube-prometheus #928

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

lentzi90
Copy link
Member

@lentzi90 lentzi90 commented Dec 13, 2024

This adds the kube-prometheus stack to our prow cluster and configures it to monitor prowjobs on top of all that it does out of the box.

I have purposefully not squashed the commits to make it easier to follow the process but I can of course squash if that is preferred. The most interesting part is probably to compare the initial commit, that has the example.jsonnet unmodified, with the changes that comes later.

Another thing to note when reviewing: everything under manifests is rendered. I include it in the repository because it is what we apply in the cluster and because it is far easier to understand than the jsonnet code. It also shows clearly what the changes to the jsonnet code does when comparing commits.

We should probably discuss if we want to keep the jsonnet code at all or if we just want to continue from the rendered manifests. We would then no longer be able to consume the upstream kubernetes-mixins and kube-prometheus, but we get one less tool and language to keep track of. It looks like k8s.io has chosen this route.
In any case, I think we can merge it with the jsonnet first, before we decide how to handle it. It will be a good historic record at least that shows where it came from.

A couple of "controversial" or notable points:

  • All in-cluster traffic is allowed on the OpenStack level. We did not need this before so it was disabled. Now prometheus needs to reach node-exporter on every node. We can limit further if necessary but I do believe this is a quite common policy.
  • Grafana is configured with anonymous auth and everyone has the viewer role. This is the same as k8s.io. The point is to make grafana practically read-only.
  • Despite the above point, I have not yet exposed grafana. Should I?
  • The prowjob dashboard is a bit outdated, but it is the same that is used in the k8s.io instances. We have a newer version of grafana though, that complains about it. Let's address this in the future.

Fixes: #896

Add initial kube-prometheus setup based on example.jsonnet.

Signed-off-by: Lennart Jern <[email protected]>
We have no use for it at the moment so lets simplify and keep things
minimal.

Signed-off-by: Lennart Jern <[email protected]>
Make all monitoring components run on infra nodes where possible.
Allow all in-cluster traffic so that prometheus can scrape node-exporter
on other nodes than where prometheus itself is running.

Signed-off-by: Lennart Jern <[email protected]>
Configure prometheus to retain data for 30 days and give it persistent
volumes to do so.

Signed-off-by: Lennart Jern <[email protected]>
This adds prometheusrules for scraping (and labeling) prowjob metrics.
It also adds the necessary labels to the allow list for
kube-state-metrics and adds a dashboard for the prowjob metrics.

Signed-off-by: Lennart Jern <[email protected]>
Make it easier to build the manifests by running the build script in a
container and by automatically installing the needed go packages.

Signed-off-by: Lennart Jern <[email protected]>
Make everyone viewer, disable signup and login forms.
Also disable gravatar.

Signed-off-by: Lennart Jern <[email protected]>
@metal3-io-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lentzi90. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 13, 2024
Add short readme for the monitoring stack and link to it from the main
prow readme.

Signed-off-by: Lennart Jern <[email protected]>
@lentzi90 lentzi90 force-pushed the lentzi90/prow-kube-prometheus branch from 1685fae to 204a910 Compare December 13, 2024 09:20
@lentzi90
Copy link
Member Author

/cc @tuminoid

@tuminoid
Copy link
Member

/cc @tuminoid

Thanks @lentzi90, I'll take a look. 99 files changed, might take a while :)

This adds the kube-prometheus stack to our prow cluster and configures it to monitor prowjobs on top of all that it does out of the box.

I have purposefully not squashed the commits to make it easier to follow the process but I can of course squash if that is preferred. The most interesting part is probably to compare the initial commit, that has the example.jsonnet unmodified, with the changes that comes later.

This is certainly a case where we don't want squashed commits. It is indeed very helpful to have them in the way you committed them.

Another thing to note when reviewing: everything under manifests is rendered. I include it in the repository because it is what we apply in the cluster and because it is far easier to understand than the jsonnet code. It also shows clearly what the changes to the jsonnet code does when comparing commits.

We should probably discuss if we want to keep the jsonnet code at all or if we just want to continue from the rendered manifests. We would then no longer be able to consume the upstream kubernetes-mixins and kube-prometheus, but we get one less tool and language to keep track of. It looks like k8s.io has chosen this route. In any case, I think we can merge it with the jsonnet first, before we decide how to handle it. It will be a good historic record at least that shows where it came from.

I agree, we need the historic portion at least. Initially, I dislike completely detaching, but I'll comment further on this after reading the actual code. I can see the point of going with the rendered tho,

A couple of "controversial" or notable points:

  • All in-cluster traffic is allowed on the OpenStack level. We did not need this before so it was disabled. Now prometheus needs to reach node-exporter on every node. We can limit further if necessary but I do believe this is a quite common policy.

I think we can add some NetworkPolicies to enforce this as a simple hardening?

  • Grafana is configured with anonymous auth and everyone has the viewer role. This is the same as k8s.io. The point is to make grafana practically read-only.

This is good approach. We're open source, our Prow is generally accessible, what would hiding it gain really? We have the ability to configure it on the backend side anyways.

  • Despite the above point, I have not yet exposed grafana. Should I?

Maybe we can have that in separate PR after this merges, to make it separate decision and self-documentary if it in future needs to be removed?

@lentzi90
Copy link
Member Author

I think we can add some NetworkPolicies to enforce this as a simple hardening?

Yes we can definitely do that!

This is good approach. We're open source, our Prow is generally accessible, what would hiding it gain really? We have the ability to configure it on the backend side anyways.

Someone could DDoS it I guess, but they could already do that with Prow so 🤷
Loading dashboards in grafana also queries prometheus, so there is a certain load and some risk that this could be used to gain deeper access in the cluster I guess. Generally this does not seem to be a big concern though, since there are loads of public grafana instances out there.

@tuminoid
Copy link
Member

Someone could DDoS it I guess, but they could already do that with Prow so 🤷 Loading dashboards in grafana also queries prometheus, so there is a certain load and some risk that this could be used to gain deeper access in the cluster I guess. Generally this does not seem to be a big concern though, since there are loads of public grafana instances out there.

Indeed, our tiny Prow can be DoS'd, no need to DDoS even. Grafana and Prometheus are not abandonware, so if we keep them up to date, and do some basic hardening on the config, we should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prow: Monitoring
3 participants