Prow monitoring through kube-prometheus #928

lentzi90 · 2024-12-13T09:17:55Z

This adds the kube-prometheus stack to our prow cluster and configures it to monitor prowjobs on top of all that it does out of the box.

I have purposefully not squashed the commits to make it easier to follow the process but I can of course squash if that is preferred. The most interesting part is probably to compare the initial commit, that has the example.jsonnet unmodified, with the changes that comes later.

Another thing to note when reviewing: everything under manifests is rendered. I include it in the repository because it is what we apply in the cluster and because it is far easier to understand than the jsonnet code. It also shows clearly what the changes to the jsonnet code does when comparing commits.

We should probably discuss if we want to keep the jsonnet code at all or if we just want to continue from the rendered manifests. We would then no longer be able to consume the upstream kubernetes-mixins and kube-prometheus, but we get one less tool and language to keep track of. It looks like k8s.io has chosen this route.
In any case, I think we can merge it with the jsonnet first, before we decide how to handle it. It will be a good historic record at least that shows where it came from.

A couple of "controversial" or notable points:

All in-cluster traffic is allowed on the OpenStack level. We did not need this before so it was disabled. Now prometheus needs to reach node-exporter on every node. We can limit further if necessary but I do believe this is a quite common policy.
Grafana is configured with anonymous auth and everyone has the viewer role. This is the same as k8s.io. The point is to make grafana practically read-only.
Despite the above point, I have not yet exposed grafana. Should I?
The prowjob dashboard is a bit outdated, but it is the same that is used in the k8s.io instances. We have a newer version of grafana though, that complains about it. Let's address this in the future.

Fixes: #896

Add initial kube-prometheus setup based on example.jsonnet. Signed-off-by: Lennart Jern <[email protected]>

We have no use for it at the moment so lets simplify and keep things minimal. Signed-off-by: Lennart Jern <[email protected]>

Make all monitoring components run on infra nodes where possible. Allow all in-cluster traffic so that prometheus can scrape node-exporter on other nodes than where prometheus itself is running. Signed-off-by: Lennart Jern <[email protected]>

Configure prometheus to retain data for 30 days and give it persistent volumes to do so. Signed-off-by: Lennart Jern <[email protected]>

This adds prometheusrules for scraping (and labeling) prowjob metrics. It also adds the necessary labels to the allow list for kube-state-metrics and adds a dashboard for the prowjob metrics. Signed-off-by: Lennart Jern <[email protected]>

Make it easier to build the manifests by running the build script in a container and by automatically installing the needed go packages. Signed-off-by: Lennart Jern <[email protected]>

Make everyone viewer, disable signup and login forms. Also disable gravatar. Signed-off-by: Lennart Jern <[email protected]>

metal3-io-bot · 2024-12-13T09:18:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lentzi90. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add short readme for the monitoring stack and link to it from the main prow readme. Signed-off-by: Lennart Jern <[email protected]>

lentzi90 · 2024-12-13T10:40:45Z

/cc @tuminoid

tuminoid · 2024-12-16T07:02:33Z

/cc @tuminoid

Thanks @lentzi90, I'll take a look. 99 files changed, might take a while :)

This adds the kube-prometheus stack to our prow cluster and configures it to monitor prowjobs on top of all that it does out of the box.

I have purposefully not squashed the commits to make it easier to follow the process but I can of course squash if that is preferred. The most interesting part is probably to compare the initial commit, that has the example.jsonnet unmodified, with the changes that comes later.

This is certainly a case where we don't want squashed commits. It is indeed very helpful to have them in the way you committed them.

Another thing to note when reviewing: everything under manifests is rendered. I include it in the repository because it is what we apply in the cluster and because it is far easier to understand than the jsonnet code. It also shows clearly what the changes to the jsonnet code does when comparing commits.

We should probably discuss if we want to keep the jsonnet code at all or if we just want to continue from the rendered manifests. We would then no longer be able to consume the upstream kubernetes-mixins and kube-prometheus, but we get one less tool and language to keep track of. It looks like k8s.io has chosen this route. In any case, I think we can merge it with the jsonnet first, before we decide how to handle it. It will be a good historic record at least that shows where it came from.

I agree, we need the historic portion at least. Initially, I dislike completely detaching, but I'll comment further on this after reading the actual code. I can see the point of going with the rendered tho,

A couple of "controversial" or notable points:

All in-cluster traffic is allowed on the OpenStack level. We did not need this before so it was disabled. Now prometheus needs to reach node-exporter on every node. We can limit further if necessary but I do believe this is a quite common policy.

I think we can add some NetworkPolicies to enforce this as a simple hardening?

Grafana is configured with anonymous auth and everyone has the viewer role. This is the same as k8s.io. The point is to make grafana practically read-only.

This is good approach. We're open source, our Prow is generally accessible, what would hiding it gain really? We have the ability to configure it on the backend side anyways.

Despite the above point, I have not yet exposed grafana. Should I?

Maybe we can have that in separate PR after this merges, to make it separate decision and self-documentary if it in future needs to be removed?

lentzi90 · 2024-12-16T10:34:34Z

I think we can add some NetworkPolicies to enforce this as a simple hardening?

Yes we can definitely do that!

This is good approach. We're open source, our Prow is generally accessible, what would hiding it gain really? We have the ability to configure it on the backend side anyways.

Someone could DDoS it I guess, but they could already do that with Prow so 🤷
Loading dashboards in grafana also queries prometheus, so there is a certain load and some risk that this could be used to gain deeper access in the cluster I guess. Generally this does not seem to be a big concern though, since there are loads of public grafana instances out there.

tuminoid · 2024-12-16T10:46:10Z

Someone could DDoS it I guess, but they could already do that with Prow so 🤷 Loading dashboards in grafana also queries prometheus, so there is a certain load and some risk that this could be used to gain deeper access in the cluster I guess. Generally this does not seem to be a big concern though, since there are loads of public grafana instances out there.

Indeed, our tiny Prow can be DoS'd, no need to DDoS even. Grafana and Prometheus are not abandonware, so if we keep them up to date, and do some basic hardening on the config, we should be fine.

lentzi90 added 7 commits December 13, 2024 10:55

Prow Monitoring

f85ea3a

Add initial kube-prometheus setup based on example.jsonnet. Signed-off-by: Lennart Jern <[email protected]>

Prow monitoring: Remove blackbox exporter

1dba8d9

We have no use for it at the moment so lets simplify and keep things minimal. Signed-off-by: Lennart Jern <[email protected]>

Prow monitoring: Prometheus retention and persistence

c9807ac

Configure prometheus to retain data for 30 days and give it persistent volumes to do so. Signed-off-by: Lennart Jern <[email protected]>

Prow monitoring: Add prowjob dashboard

2428724

This adds prometheusrules for scraping (and labeling) prowjob metrics. It also adds the necessary labels to the allow list for kube-state-metrics and adds a dashboard for the prowjob metrics. Signed-off-by: Lennart Jern <[email protected]>

Prow monitoring: Add makefile and containerize build

c16dd3a

Make it easier to build the manifests by running the build script in a container and by automatically installing the needed go packages. Signed-off-by: Lennart Jern <[email protected]>

Prow monitoring: Disable grafana login

f42f57e

Make everyone viewer, disable signup and login forms. Also disable gravatar. Signed-off-by: Lennart Jern <[email protected]>

metal3-io-bot requested review from Rozzii and smoshiur1237 December 13, 2024 09:18

metal3-io-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 13, 2024

Document prow monitoring

204a910

Add short readme for the monitoring stack and link to it from the main prow readme. Signed-off-by: Lennart Jern <[email protected]>

lentzi90 force-pushed the lentzi90/prow-kube-prometheus branch from 1685fae to 204a910 Compare December 13, 2024 09:20

metal3-io-bot requested a review from tuminoid December 13, 2024 10:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prow monitoring through kube-prometheus #928

Prow monitoring through kube-prometheus #928

lentzi90 commented Dec 13, 2024 •

edited

Loading

metal3-io-bot commented Dec 13, 2024

lentzi90 commented Dec 13, 2024

tuminoid commented Dec 16, 2024

lentzi90 commented Dec 16, 2024

tuminoid commented Dec 16, 2024

Prow monitoring through kube-prometheus #928

Are you sure you want to change the base?

Prow monitoring through kube-prometheus #928

Conversation

lentzi90 commented Dec 13, 2024 • edited Loading

metal3-io-bot commented Dec 13, 2024

lentzi90 commented Dec 13, 2024

tuminoid commented Dec 16, 2024

lentzi90 commented Dec 16, 2024

tuminoid commented Dec 16, 2024

lentzi90 commented Dec 13, 2024 •

edited

Loading