Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram unbounded memory even with run_upkeep #467

Closed
gauravkumar37 opened this issue Mar 26, 2024 · 13 comments · Fixed by #537
Closed

Histogram unbounded memory even with run_upkeep #467

gauravkumar37 opened this issue Mar 26, 2024 · 13 comments · Fixed by #537
Labels
C-exporter Component: exporters such as Prometheus, TCP, etc. E-simple Effort: simple. T-ergonomics Type: ergonomics. T-request Type: request.

Comments

@gauravkumar37
Copy link

Thanks for a wonderful library. After the recent merge of #460 I tested the upkeep and idle timeout configs but still see unbounded memory usage if /metrics endpoint is not being called.
Is this still expected even after upkeep?

Testing methodology:

  • Using axum with axum-prometheus
  • Using /metrics
  • Using wrk to benchmark an API with a simple health endpoint
  • Memory keep on increasing unbounded
  • Code has idle_timeout of 60s and upkeep of 5s
  • If I now open /metrics endpoint, memory eventually reduces and the more I open it periodically, the more stable the memory is

So, is my observed behavior still expected even after the introduction of upkeep?

@tobz
Copy link
Member

tobz commented Mar 27, 2024

Hmm, the latest (0.6.1) version of axum-prometheus is still on 0.13.x of metrics-exporter-prometheus. Are you using a development version or patching metrics-exporter-prometheus to 0.14.0?

@gauravkumar37
Copy link
Author

Correct, hence I patched the version of metrics-exporter-prometheus in axum-prometheus to 0.14.0 to test, also able to verify that adding upkeep_timeout compiles correctly, also verified the 0.14.0 in Cargo.lock.
Patch location: https://github.com/gauravkumar37/axum-prometheus/tree/patch-1

@tobz
Copy link
Member

tobz commented Mar 27, 2024

Looking at how axum-prometheus uses metrics-exporter-prometheus, this makes sense now: PrometheusBuilder doesn't create the upkeep task unless PrometheusBuilder::build or PrometheusBuilder::install are called, whereas right now, it calls PrometheusBuilder::install_recorder.

Not really sure there's a ton we can do here, or to be frank, that I'm willing to do here. The number of ways to build/install the recorder is already too high for comfort because it's trying to be all things to all people. The simplest thing would be if axum-prometheus switched to using PrometheusBuilder::build and discarded the exporter future if it's not in push gateway mode. It would still have to install the recorder, but it would allow it the chance to grab a recorder handle, and it would allow the upkeep task to be created.

@tobz tobz added C-exporter Component: exporters such as Prometheus, TCP, etc. E-simple Effort: simple. T-ergonomics Type: ergonomics. T-request Type: request. labels Mar 27, 2024
@gauravkumar37
Copy link
Author

This is what I am using inside axum framework:

    PrometheusMetricLayerBuilder::new()
        .with_metrics_from_fn(|| {
            PrometheusBuilder::new()
                .idle_timeout(MetricKindMask::HISTOGRAM, Some(Duration::from_secs(300)))
                .upkeep_timeout(Duration::from_secs(60))
                .set_buckets(&[0.1, 0.2, 0.3])
                .unwrap()
                .install_recorder()
                .unwrap()
        })
        .build_pair()

This does create the PrometheusBuilder and installs it. Is there something more to do here?

@tobz
Copy link
Member

tobz commented Mar 27, 2024

Right, you're using install_recorder, when you need to use build.

@gauravkumar37
Copy link
Author

gauravkumar37 commented Mar 29, 2024

Ok, I got it to work correctly but I must say, that was an unexpected nuance there :-)

So, something like this works and there is no more unbounded memory growth:

    PrometheusMetricLayerBuilder::new()
        .with_metrics_from_fn(|| {
            let (recorder, _) = PrometheusBuilder::new()
                // if metrics are not updated within this time, they will be removed
                .idle_timeout(MetricKindMask::HISTOGRAM, Some(Duration::from_secs(5 * 60)))
                // prevents unbounded memory growth by draining histogram data periodically
                .upkeep_timeout(Duration::from_secs(5))
                .set_buckets(buckets_in_secs.as_slice())
                .expect_or_log(&format!("Failed to set buckets: '{buckets_in_secs:?}'"))
                // instead of install_recorder, use build because the former doesn't start the upkeep tasks
                .build()
                .expect_or_log("Failed to build metrics recorder");
            let handle = recorder.handle();
            metrics::set_global_recorder(recorder).expect_or_log("Failed to set global recorder");
            handle
        })
        .build_pair()

++ @Ptrskay3 for changes in axum-prometheus

@Ptrskay3
Copy link

Thanks for the ping. I'm open to changing axum_prometheus, if that makes it safer for most of the people. Not sure when I'll find the time to do it myself however. Created Ptrskay3/axum-prometheus#51.

@tobz
Copy link
Member

tobz commented Mar 31, 2024

@Ptrskay3 Thanks for opening that. 👍🏻

Like I said upthread, I'm also not a fan of how thorny the crate (this crate, not yours) has become in terms of the API... but it does seem like taking the approach shown by @gauravkumar37 would be the simplest one.

I'll also make a note for myself to potentially contribute that as a PR to axum_prometheus.

@Ptrskay3
Copy link

Just released 0.7 of axum-prometheus, that should fix this issue if people using the default implementation.

@anastasia-tarasenko
Copy link

@tobz Do you consider to add upkeep_timeout task to the install_recorder() (or build_recorder) method? If somebody needs a recorder and creates it using install_recorder method then there is a still bug with the memory usage. It seems that this task should be presented there.

@tobz
Copy link
Member

tobz commented Oct 29, 2024

When users opt to use PrometheusBuilder::build_recorder or PrometheusBuilder::install_recorder, they're specifically opting into a lower level of control where they're handling how the recorder is installed and/or how the resulting rendered metrics are emitted and consumed. They're also opting into handling upkeep manually as well.

All of that said, the documentation on this isn't good and doesn't explain this for PrometheusBuilder::build_recorder or PrometheusBuilder::install_recorder, so I'm going to open a PR to make those changes.

@anastasia-tarasenko
Copy link

@tobz Thank you for explanations. I have one more question. What was the main reason why so complicated logic was used for the histogram data and why it can not be calculated immediately in the time when it received?

@tobz
Copy link
Member

tobz commented Oct 31, 2024

It mostly boils down to concurrency.

When I originally wrote the Prometheus exporter, I needed something that could support histograms or summaries, which meant needing to keep track of the raw samples to accurately generate the summary data. Since we need the raw samples, this forces us to be able to concurrently accept the incoming histogram samples (since there can be multiple threads updating the same metric at the same time) which is why I wrote AtomicBucket<T>.

If we didn't need to care about aggregated summaries, we could do something better like an array of atomic integers -- one integer per histogram bucket -- and just update them directly, as you allude to. It might be worth doing that overall, and just sort of declaring "we don't support aggregated summaries because they're crap and make things harder, sorry" but I don't have a strong enough opinion/need so it's never really been on my list of things to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-exporter Component: exporters such as Prometheus, TCP, etc. E-simple Effort: simple. T-ergonomics Type: ergonomics. T-request Type: request.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants