Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring for ironic-prometheus-exporter #340

Closed

Conversation

iurygregory
Copy link
Contributor

@iurygregory iurygregory commented Jul 3, 2019

This PR depends on #302 and adds the changes to enable monitoring
in the machine-api-operator that will allow Prometheus
to collect data from the ironic-prometheus-exporter[1] that
runs in the ironic-image [2].

  • Added monitoring label to namespace yaml
  • Added monitoring information to the rbac yaml
  • Added Service for the ironic-prometheus-exporter
  • Added the ServiceMonitor for the ironic-prometheus-exporter
  • Added PrometheusRule with alert for baremetal_temp_celsius
    metric.

[1] https://github.com/metal3-io/ironic-prometheus-exporter
[2] https://github.com/metal3-io/ironic-image

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 3, 2019
@iurygregory
Copy link
Contributor Author

@openshift/sig-monitoring

@brancz
Copy link

brancz commented Jul 3, 2019

I don’t actually see the service or daemonset for this exporter so it’s hard to say if this is correct. Generally speaking it looks ok, but I’d like to avoid having servicemonitors that don’t actually discover/select anything.

@russellb
Copy link
Member

russellb commented Jul 3, 2019

This will need to be reworked so that it's only applied when the platform is of type "baremetal". I think it would make sense to include this as part of #302, since that's the PR that would deploy the ironic promethus exporter as well.

@brancz
Copy link

brancz commented Jul 4, 2019

Yes it would be easier to verify this if it included the deployment of the exporter.

@hardys
Copy link

hardys commented Jul 4, 2019

Agreed, this should either be rolled into #302 or this PR can be rebased onto that one when ready.

@iurygregory iurygregory force-pushed the ironic_prometheus_exporter branch from b8d1ced to c2aa4fc Compare July 11, 2019 17:01
@iurygregory iurygregory changed the title ServiceMonitor for ironic-prometheus-exporter Monitoring for ironic-prometheus-exporter Jul 11, 2019
@iurygregory iurygregory force-pushed the ironic_prometheus_exporter branch from c2aa4fc to e25d72e Compare July 16, 2019 08:47
@iurygregory
Copy link
Contributor Author

I would like some reviews to see if this is ok even if it will depends on PR #302 =)

@iurygregory iurygregory force-pushed the ironic_prometheus_exporter branch 2 times, most recently from 130867c to b10f7c0 Compare July 17, 2019 14:02
@iurygregory iurygregory force-pushed the ironic_prometheus_exporter branch from b10f7c0 to bbd25b6 Compare July 18, 2019 12:07
@hardys
Copy link

hardys commented Aug 15, 2019

Since we had to switch to an older version of ironic that doesn't contain the exporter, ref #380 this should probably be abandoned until we can consume a newer ironic version?

@sadasu
Copy link
Contributor

sadasu commented Nov 6, 2019

@hardys Is the version of Ironic we are using now in deploying "metal3" appropriate to restart this work?

@iurygregory
Copy link
Contributor Author

@sadasu I'm re-working this PR, the version on metal3 is fine but for openshift we will first need openshift/ironic-image#27 and also an update on BMO to be aware of the ironic-prometheus-exporter container

@iurygregory iurygregory force-pushed the ironic_prometheus_exporter branch from bbd25b6 to d56877f Compare February 20, 2020 15:33
@hardys
Copy link

hardys commented Feb 20, 2020

@sadasu can you take another look at this please - in particular I'd like to confirm how we can ensure the Service/ServiceMonitor etc only get created for the platform: baremetal case where we expect the metal3 pod part to be enabled?

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign bison
You can assign the PR to them by writing /assign @bison in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

apiVersion: monitoring.coreos.com/v1
kind: Service
metadata:
name: metal3-baremetalhost-controller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the "name" here should be. But, "metal3-baremetalhost-controller" is the name of the baremetal_operator controller. See https://github.com/metal3-io/baremetal-operator/blob/master/pkg/controller/baremetalhost/baremetalhost_controller.go#L105

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata:
labels:
app: ironic-exporter
name: metal3-baremetalhost-controller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: metal3-baremetalhost-controller
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadasu
Copy link
Contributor

sadasu commented Feb 20, 2020

The .yaml files added in the /install dir will be installed by the CVO for all platforms. What would the behavior be when the ironic-prometheus-exporter container is not running for non "BareMetal" platform types?

@sadasu
Copy link
Contributor

sadasu commented Feb 20, 2020

@iurygregory
Copy link
Contributor Author

You also need to add the ironic-prometheus-exporter image to https://github.com/openshift/machine-api-operator/blob/master/install/image-references and https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/fixtures/images.json.

The exporter doesn't need a new image AFAIK, we will be using the ironic image (https://github.com/openshift/ironic-image) , we only need to add the container for the exporter.

@simonpasquier
Copy link

AFAICT creating the service, service monitor and prometheus rule with CVO shouldn't be an issue for non-baremetal platforms as the service monitor would match 0 endpoints and the Prometheus rule would generate 0 series. That being said, a better option would be to have the machine API operator manage those resources as it knows whether or not the Ironic exporter is deployed.

@iurygregory iurygregory force-pushed the ironic_prometheus_exporter branch 3 times, most recently from 0155292 to 5214e4b Compare March 5, 2020 14:50
- alert: HighCPUTemperature
annotations:
summary: "The baremetal node {{ $labels.node_name }} CPU {{ $labels.entity_id }} is too high"
description: "The baremetal node {{ $labels.node_name }} CPU {{ $labels.entity_id }} was too high in the past 5 minutes. Last measurement {{ $value }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this trigger for non baremetal envs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should porbably rename syncBaremetalControllers to something more generic e.g syncBaremetal
Then syncBaremetal calls syncBaremetalControllers and syncBaremetalMonitoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre I don't think it would trigger in non baremetal envs, check @simonpasquier comment.
The sync part you are talking would be if we make the yamls go code right?

This commit adds the changes that will allow Prometheus
to collect data from the ironic-prometheus-exporter[1] that
runs in the ironic-image [2].

- Added Service for the ironic-prometheus-exporter with same
run level of the openshift-machine-api.
- Added the ServiceMonitor for the ironic-prometheus-exporter
- Added PrometheusRule with alerts for baremetal_temp_celsius
metric.
- Added the ironic-exporter container

Note: Using the run level 90 to ServiceMonitor and PrometheusRule
to ensure that the Service and the Prometheus Role and RoleBinding
have been applied.

[1] https://github.com/metal3-io/ironic-prometheus-exporter
[2] https://github.com/metal3-io/ironic-image
@iurygregory iurygregory force-pushed the ironic_prometheus_exporter branch from 5214e4b to cdfe1f8 Compare March 11, 2020 15:21
@openshift-ci-robot
Copy link
Contributor

@iurygregory: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/jenkins/integration bbd25b6 link /test integration
ci/prow/yaml-lint cdfe1f8 link /test yaml-lint
ci/prow/unit cdfe1f8 link /test unit
ci/prow/e2e-gcp-operator cdfe1f8 link /test e2e-gcp-operator
ci/prow/e2e-aws cdfe1f8 link /test e2e-aws
ci/prow/e2e-azure-operator cdfe1f8 link /test e2e-azure-operator
ci/prow/e2e-aws-upgrade cdfe1f8 link /test e2e-aws-upgrade
ci/prow/e2e-azure cdfe1f8 link /test e2e-azure
ci/prow/e2e-aws-operator cdfe1f8 link /test e2e-aws-operator
ci/prow/images cdfe1f8 link /test images
ci/prow/e2e-aws-scaleup-rhel7 cdfe1f8 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-gcp cdfe1f8 link /test e2e-gcp

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@iurygregory
Copy link
Contributor Author

Does anyone know people from the CVO that can take a look at the PR to see if this approach (using the if in the yamls is valid?)

@enxebre
Copy link
Member

enxebre commented Mar 17, 2020

Does anyone know people from the CVO that can take a look at the PR to see if this approach (using the if in the yamls is valid?)

What's that coming from usingBaremetal? It seems to be breaking all tests. Anything under install/ is to be managed by the CVO and the only way to discriminate are profiles which are not a good case for this scenario. openshift/enhancements#200

Can we please move any baremetal specific under https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/sync.go#L97? Any monitoring related resource could be managed there, you could rename syncBaremetalControllers and let it manage the monitoring resources that you could define as yaml or golang code whatever you feel more comfortable with. e.g https://github.com/openshift/cluster-autoscaler-operator/blob/8e6f95038c9eee84ef7e305e2e1f4960c918b30d/pkg/controller/clusterautoscaler/monitoring.go

@iurygregory
Copy link
Contributor Author

iurygregory commented Mar 17, 2020

Does anyone know people from the CVO that can take a look at the PR to see if this approach (using the if in the yamls is valid?)

What's that coming from? It's breaking all tests. Anything under install/ is to be managed by the CVO and the only way to discriminate are profiles which are not a good case for this scenario. openshift/enhancements#200

Can we please move any baremetal specific under https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/sync.go#L97? Any monitoring related resource could be managed there, you can rename syncBaremetalControllers and let it manage the monitoring resources that you can define as yaml or golang code whatever you feel more comfortable with.

Hi @enxebre , I was talking with some people from kuryr and they pointed me to https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/kuryr/008-admission.secret.yaml#L1 (not sure if MAO has support for that, but the conditions would ensure that the yamls would only apply when it's true).

Is there any other directory where I can place the yamls and make a call in syncBaremetalControllers to apply?

@enxebre
Copy link
Member

enxebre commented Mar 17, 2020

Hi @enxebre , I was talking with some people from kuryr and they pointed me to https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/kuryr/008-admission.secret.yaml#L1 (not sure if MAO has support for that, but the conditions would ensure that the yamls would only apply when it's true).
Is there any other directory where I can place the yamls and make a call in syncBaremetalControllers to apply?

That's using bindata to compile and embed the yamls into static golang code that you can template at runtime as you wish. You could follow the same approach here. e.g
https://github.com/openshift/cluster-kube-apiserver-operator/tree/master/bindata/v4.1.0

https://github.com/openshift/cluster-kube-apiserver-operator/tree/master/pkg/operator/v410_00_assets

https://github.com/openshift/cluster-kube-apiserver-operator/blob/3161546a248f20eb67231017d0ae43c245bfa4cb/pkg/recovery/apiserver.go#L134-L177

Or alternatively you can define your resources as API golang types e.g https://github.com/openshift/cluster-autoscaler-operator/blob/8e6f95038c9eee84ef7e305e2e1f4960c918b30d/pkg/controller/clusterautoscaler/monitoring.go

@openshift-ci-robot openshift-ci-robot removed the request for review from spangenberg July 24, 2020 14:17
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2020
@openshift-merge-robot
Copy link
Contributor

@iurygregory: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/generate cdfe1f8 link /test generate

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 6, 2020
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.