Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting the previous state on the current scrape cycle #130

Open
pgeorgie1 opened this issue Jul 16, 2024 · 4 comments
Open

Reporting the previous state on the current scrape cycle #130

pgeorgie1 opened this issue Jul 16, 2024 · 4 comments

Comments

@pgeorgie1
Copy link

Summary: There is a possible race condition defect related with reporting the previous state during the current scrape cycle

Found in: sql_exporter, version 0.4.5

Actual state:
We have a scrape job which is performing database query one each scrape cycle. The SQL query, executed at each scrape cycle, is adding current_timestamp field to each SQL statement result. As result, each scrape data point shall be uniques, since the scrape cycles are performed at each 4 minute interval, the current_timestamp fields on consequential samples will be roughly 4 minutes offsetted.
The problematic observation is that when we inspect the data points, persisted in the Prometheus, we found that there are data point with 8 minutes interval and with the current setup, where the current_timestamp is rendering each data point to be unique, this behavior is incorrect.
We shall highlighted that for Prometheus, merging two sequential data point in a combined data point with totalinterval equals to the intervals of the merged data points is only possible if the sequential data points are completely equal (all their fields are equals).
In our case this merged data point behavior shall not be possible since each data point has unique current_timestamp field.
This improper behavior is presented at the attached screenshots, it might be observed that the referred data point has length of 8 minutes, instead of displaying two data point with 4 minutes length. One might observe that the screenshot is rendering just one broken data point, and all other data points have 4 minutes length.

a sample for broken data point:

a broken data point starting time:
start-time

a broken data point ending time:
end-time

Our root-cause hypothesis is that sometimes the 4 minute scrape cycle interval is not enough to perform the SQL query for the current scrap cycle and as result sql-exporter is reporting the previous state (the previous already persisted data point state).
The exposed root-cause hypothesis is supporting the observation that single data point are reported two consequential times and merged into one data point with double length (8 minutes, instead of expected 4) since all filed for the properties are equals, including the current sample timestamp.

Desired state:
A data state shall not be reported more than once, possibly when a data point is reported successfully the its state shall be removed from the registry of the sql-exported service. Possible an warning message shall be raised, if the current SQL queries is not completed in the scope of current scraping cycle interval.

@dewey
Copy link
Member

dewey commented Jul 17, 2024

Thanks for reporting, I won't have time to work on this, but if someone wants to tackle that and submit a PR I can take a look. In general I'd try to minimize the runtime of queries that the sql_exporter is executing as the main use case is really for simple health check queries that are lightweight and fast.

@pgeorgie1
Copy link
Author

Hello Philipp,

Thank you for the fast response. In our situation optimizing a particular SQL query will not be a sufficient solution, since we have a number of heavy queries and soon or later we are going to face the same issue with another scrape query.
In my opinion the future of the sql-exporter service will be scraping such heavy SQL query aggregations, similarly to our setup, and this issue will be faced by a lot of sql-exporter users. I believe that it will be very beneficial for sql-exporter project to fix that issue now, before more and more users get affected by this misbehavior.
Even worse, in our case we have a very convenient way to detect and root-cause the nature of the issue, there might be more complicated situations with other sql-exporter users and the might not be able to even understand what the problem is.

@dewey
Copy link
Member

dewey commented Jul 17, 2024

In my opinion the future of the sql-exporter service will be scraping such heavy SQL query aggregations, similarly to our setup, and this issue will be faced by a lot of sql-exporter users.

You are the first to report this, and we also haven't seen that in our own usage over the past 7 years. The project is provided "as is", but contributions and bug fixes are very welcome.

Thanks!

@marevers
Copy link
Contributor

marevers commented Jul 18, 2024

@pgeorgie1 my metrics with this exporter do not have that label, but there are other ways to solve this:

  1. use an aggregation that removes this label e.g. sum without (sample_timestamp)
  2. drop the label in your Prometheus scrape config

Our root-cause hypothesis is that sometimes the 4 minute scrape cycle interval is not enough to perform the SQL query for the current scrap cycle and as result sql-exporter is reporting the previous state

In that case I would argue you need to change your configuration in order to make sure the DB scrapes don't overlap with each other. If you have queries that are expected to take longer than the configured interval, then you should increase the interval to make sure this will not happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants