Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Elasticsearch types in Cockroachdb module #17736

Closed
wants to merge 9 commits into from

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Apr 15, 2020

What does this PR do?

Use native Elasticsearch types for CockroachDB data so histograms are stored much more efficiently.

Adapt dashboard to use these types and some other fixes (see section about this below).

Why is it important?

Align CockroachDB module with latest Prometheus changes to leverage the use of new histogram type.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Verify that the fix in histograms makes sense. (Moved to Fix prometheus histogram rate overflows #17753)
  • Create a follow-up issue to investigate responses with duplicated histograms.
  • Find a way to maintain previous visualizations for raft latency.
    • Can nanoseconds be formated?
  • Add some additional check to system tests.

How to test this PR locally

Run the CockroachDB module, check that dashboard works.

Related issues

  • Relates to #14064

Dashboard

There are some changes in dashboards:

  • Calculation of ranges is more accurate now.
  • Histogram-based visualizations have been replaced to show more representative data, and use 99th percentile instead of average.
    Captura de pantalla de 2020-04-17 18-22-33

@@ -731,46 +731,11 @@
"metrics": [
{
"agg_with": "avg",
"field": "prometheus.metrics.raft_process_logcommit_latency_count",
"field": "prometheus.raft_process_logcommit_latency.histogram",
Copy link
Member Author

@jsoriano jsoriano Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current dashboard is using sum and count to calculate the average of this value. I think it can make sense now to calculate percentiles, but I haven't managed to use histograms in TSVB yet. @exekias do you know if they are already supported?

Copy link
Member Author

@jsoriano jsoriano Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works with other visualizations, I will go on with line graphs by now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently only Visualize supports this type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced the graphs that were using sum and count to calculate averages and they are using 99th percentile now (as the CockroachDB admin UI does). It is quite ok now but the timings are in nanoseconds and I haven't found a way to format them.

@jsoriano jsoriano self-assigned this Apr 15, 2020
@jsoriano jsoriano added [zube]: In Progress enhancement in progress Pull request is currently in progress. module Team:Platforms Label for the Integrations - Platforms team labels Apr 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@jsoriano
Copy link
Member Author

I have moved changes for fields validation to #17759

jsoriano added a commit that referenced this pull request Apr 17, 2020
Fix some overflows on Prometheus histogram rate calculations.
They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see #17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.
jsoriano added a commit to jsoriano/beats that referenced this pull request Apr 17, 2020
Fix some overflows on Prometheus histogram rate calculations.
They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see elastic#17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 0afffa8)
jsoriano added a commit to jsoriano/beats that referenced this pull request Apr 17, 2020
Fix some overflows on Prometheus histogram rate calculations.
They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see elastic#17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 0afffa8)
@jsoriano jsoriano force-pushed the cockroachdb-use-types branch from f34de4f to 55e9891 Compare April 17, 2020 14:32
jsoriano added a commit that referenced this pull request Apr 20, 2020
…17783)

Fix some overflows on Prometheus histogram rate calculations.

They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see #17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 0afffa8)
jsoriano added a commit that referenced this pull request Apr 20, 2020
…17784)

Fix some overflows on Prometheus histogram rate calculations.

They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see #17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 0afffa8)
@@ -4,3 +4,16 @@ input:
metricset: collector
defaults:
metrics_path: /_status/vars
use_types: true
processors:
- drop_fields:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this could make use of metrics_filters instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, as all sql_mem_sql... metrics are duplicated in cockroachdb. But only histograms are problematic, the rest are gauges, and in principle Metricbeat doesn't have any problem with duplicated gauges.
Also, removing these metrics at this point ensures that they are never collected, even if the user sets its own metrics_filters, or adds its own processors.

@mergify
Copy link
Contributor

mergify bot commented Sep 22, 2021

This pull request does not have a backport label. Could you fix it @jsoriano? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Sep 22, 2021
@mergify
Copy link
Contributor

mergify bot commented Nov 10, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b cockroachdb-use-types upstream/cockroachdb-use-types
git merge upstream/master
git push upstream cockroachdb-use-types

@jlind23
Copy link
Collaborator

jlind23 commented Mar 31, 2022

@jsoriano - Closing this one as there were no activity for a while

@jlind23 jlind23 closed this Mar 31, 2022
@zube zube bot removed the [zube]: Done label Jun 30, 2022
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…lows (elastic#17784)

Fix some overflows on Prometheus histogram rate calculations.

They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see elastic#17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 2d6e0ca)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify enhancement in progress Pull request is currently in progress. module Team:Platforms Label for the Integrations - Platforms team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants