Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable KEDA as a single source of truth on scaler metrics #1281

Open
3 tasks
Tracked by #3698
hacst opened this issue Oct 22, 2020 · 29 comments
Open
3 tasks
Tracked by #3698

Enable KEDA as a single source of truth on scaler metrics #1281

hacst opened this issue Oct 22, 2020 · 29 comments
Assignees
Labels
feature All issues for new features that have been committed to needs-discussion opentelemetry stale-bot-ignore All issues that should not be automatically closed by our stale bot

Comments

@hacst
Copy link

hacst commented Oct 22, 2020

Enable KEDA to always output the scaler metrics it determines through a prometheus metrics endpoint for ScaledObjects and ScaledJobs. This allows it to becomes the single source of truth on metrics on scaling.

Use-Case

A very common requirement in a job queuing/event processing use-case is having metrics available for monitoring and alerting. The metrics allow alerting on work backing up and observing progress of work. Even though KEDA has to know this information for its own work it currently does not make it available in a way that allows it to be consistently used for these purposes.

KEDA 2 provides keda_metrics_adapter_scaler_metrics_value but that is only available for ScaledObjects with active deployments as it is produced by the metrics adapter. This means the metric is not available when no work is pending and more importantly never when ScaledJobs are used.

Currently the only way around this is to have another always running service/exporter which replicates the work the KEDA scalers performed to determine its information and use it to generate the metric. This nullifies a lot of the benefit of using KEDA scalers to begin with as instead of making use of the different scalers you could instead directly attach to this metric ensuring what your monitoring sees is consistent with what KEDA uses for scaling.

KEDA is in the ideal position to produce the metric as the single source of truth. It already needs to have the information and is always running. A user adding a ScaledJob/ScaledObject would naturally make the metric for it appear. This enables self-serving of the metric for developers solely based on KEDA CRDs. Also this way dashboards and alerting are based on the same information that was actually used for scaling.

Specification

  • Output scaler metrics values also for ScaledJobs
  • Make scaler metrics presentation consistent for ScaledObjects and ScaledJobs (tags can be used to differentiate)
  • Provide scaler metrics continuously even if no work is pending (e.g. through KEDA Operator)
@hacst hacst added feature-request All issues for new features that have not been committed to needs-discussion labels Oct 22, 2020
@tbickford
Copy link
Contributor

I think having a centralized way of pulling metrics (as opposed to having the metrics adapter and operator pull them separately) would be great but that would require some serious rework I believe.

As an interim step maybe it would make sense to look at what it would take to add Prometheus metrics collecting and exporting to the KEDA Operator first?

@hacst
Copy link
Author

hacst commented Oct 22, 2020

I have basically no knowledge about keda's internals so I'm definitely not a good candidate to judge implementation complexity.

From what I understand adding a metrics endpoint to the operator would be able to provide ScaledJobs metrics and do so continuously? That would be a valuable thing on its own I think. We could use it.

Does the operator also have continuous knowledge of the ScaledObject metrics? Or does it stop polling those once the deployment becomes active (I kinda assumed it would)? If it polls continuously then the metrics in the operator if provided to the outside could already feel pretty SSOT to a user.

Making keda itself also not query redundantly and guarantee it is consistent internally as a benefit definitely could be a longer term thing. I didn't even consider that aspect when writing this FR.

@zroubalik
Copy link
Member

Does the operator also have continuous knowledge of the ScaledObject metrics? Or does it stop polling those once the deployment becomes active (I kinda assumed it would)? If it polls continuously then the metrics in the operator if provided to the outside could already feel pretty SSOT to a user.

It polls continuously to find out whether the scaler is still active or not, so that should be doable.

Btw, KEDA is built on operator-sdk framework, which is based on Kubebuilder. There should be an API to expose metrics, see the docs: https://sdk.operatorframework.io/docs/building-operators/golang/advanced-topics/#metrics

@hacst
Copy link
Author

hacst commented Oct 23, 2020

Sounds good. If you think this could be a good first issue for someone and no one else already plans to tackle this short term I could try taking a stab at an implementation.

@tomkerkhove
Copy link
Member

Added it to our roadmap to consider and our standup to discuss.

I like the idea, but we need to think this through if we won't be trying to solve two problems badly. But I'm not opposed to it.

@tomkerkhove
Copy link
Member

tomkerkhove commented Oct 26, 2020

The reason why I'm saying that is that I also maintain https://github.com/tomkerkhove/promitor and it starts with just adding Prometheus support; until somebody asks for another metric system to support, etc etc.

And then we're not even talking about the performance impact if we start having Prometheus et al poke us very often.

So I'm just trying to avoid that KEDA become a metric scraper rather than focussing on app scaling.

(To be clear - I'm definitely not opposed given I have a solution for scraping Azure Monitor metrics)

@tomkerkhove
Copy link
Member

It polls continuously to find out whether the scaler is still active or not, so that should be doable.

Btw, KEDA is built on operator-sdk framework, which is based on Kubebuilder. There should be an API to expose metrics, see the docs: sdk.operatorframework.io/docs/building-operators/golang/advanced-topics/#metrics

If we just expose the metrics there, I don't see much of a problem to be honest.

@tomkerkhove
Copy link
Member

Bottom line - We need a design doc on this I think :)

@tomkerkhove
Copy link
Member

We've discussed this on our standup and decide to provide metrics for all systems we scrape for metrics so they can be re-used in other systems.

@smalgin
Copy link

smalgin commented Mar 4, 2021

Bumping this.

@stale
Copy link

stale bot commented Oct 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Oct 14, 2021
@hacst
Copy link
Author

hacst commented Oct 14, 2021

Any update on this? I still think it would be a very useful feature.

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Oct 14, 2021
@tomkerkhove tomkerkhove added feature All issues for new features that have been committed to and removed feature-request All issues for new features that have not been committed to labels Oct 14, 2021
@tomkerkhove
Copy link
Member

Not yet but making sure we keep tracking and don't close.

@smalgin
Copy link

smalgin commented Oct 14, 2021

Bump!

@tomkerkhove tomkerkhove moved this to Backlog in Roadmap - KEDA Core Feb 10, 2022
@tomkerkhove tomkerkhove moved this from To Do to Proposed in Roadmap - KEDA Core Feb 11, 2022
@tomkerkhove tomkerkhove added the stale-bot-ignore All issues that should not be automatically closed by our stale bot label Apr 19, 2022
@tomkerkhove
Copy link
Member

So from how I see it we should provide an extension point so that end-users can consume and/or store the metrics that we use. This would also help us get data to eventually start doing predictive activation with (see #197).

My current proposal would be to start emitting all the metrics that we expose through the metric server towards an OpenTelemetry Collector. Optionally, we can add Prometheus as well but I'd standardize on OpenTelemetry going forward instead.

@tomkerkhove
Copy link
Member

Any thoughts on this @kedacore/keda-maintainers?

@JorTurFer
Copy link
Member

I'm not 100% sure about this, if we expose the value, anyone in the cluster could go and get the value from there 🤔
Apart from that, I'd add it for the moment to prometheus server due to it's the metrics server that we support right now

@tomkerkhove
Copy link
Member

I'm not 100% sure about this, if we expose the value, anyone in the cluster could go and get the value from there 🤔

Certainly, but it's up to the team owning the OpenTelemetry Collector to secure it how it needs to be secured. We just push the metrics to them and that's where our responsibilities end.

Apart from that, I'd add it for the moment to prometheus server due to it's the metrics server that we support right now

Prometheus is just one metric platform, pushing to an OpenTelemetry Collector allows end-users to use the metrics wherever they want.

We can definitely add them in Prometheus as well, but I believe OpenTelemetry Collector is more open and should thus be our standard.

@zroubalik
Copy link
Member

TBH I would love to see both Prom and OpeTelemetry implemented. Ideally one common function that will handle the metrics exposure based on some config.

@tomkerkhove
Copy link
Member

Both are fine for me, but personally, I think OpenTelemetry is the required piece given it also allows end-users to use Prometheus.

So if OpenTelemetry Collector is supported, then I'm happy!

@smalgin
Copy link

smalgin commented May 30, 2022

Yes, the opentele is a 'must have' piece. I agree that the security of the Collector is the responsibility of the Collector impl. Also, it will all run on a secure cluster subnet, so that's something.

@hacst
Copy link
Author

hacst commented May 30, 2022

Nice to see movement on this issue. Personally I would find native prometheus support quite useful. Having open telemetry collector (especially for metrics) is still quite rare. Native prometheus will stay the de facto standard for quite a while and it's also what was supported in keda up to now. I agree that OTEL is great going forward though.

@tomkerkhove
Copy link
Member

Let's do both then and re-evaluate when we go to KEDA v3.x (if we ever have to, no plans for now)

@tomkerkhove
Copy link
Member

Anyone who is eager to contribute this?

@ilaleksin
Copy link

@tomkerkhove may I contribute this feature?

@tomkerkhove
Copy link
Member

Most definately yes, thanks a ton! Feel free to post design proposal here or a draft PR.

Maybe a "design proposal" on the metric names/approach would be ideal.

@tomkerkhove tomkerkhove moved this from Proposed to In Progress in Roadmap - KEDA Core Jun 2, 2022
@JorTurFer
Copy link
Member

@ilyalexin
We aim to do a release next week, do you have any update about this topic? No rush at all, we can just postpone this to next release

@ilaleksin
Copy link

@JorTurFer
I am working on this feature but it will not be completed next week.
Please postpone it to the next release

@JorTurFer
Copy link
Member

No problem at all! Thanks for the update 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature All issues for new features that have been committed to needs-discussion opentelemetry stale-bot-ignore All issues that should not be automatically closed by our stale bot
Projects
Status: In Progress
Development

No branches or pull requests

7 participants