Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring telemetry collection interval #34609

Merged
merged 21 commits into from
Apr 27, 2019

Conversation

Bamieh
Copy link
Member

@Bamieh Bamieh commented Apr 5, 2019

Work on @chrisronline draft PR #33705


Relates to #32519

This is a straight copy and paste from this comment. This PR implements this idea.

In order for this PR to work, you'll need to update the .monitoring-kibana index template in ES.

PUT _template/.monitoring-kibana

Copy and paste the json from here -> https://gist.github.com/chrisronline/fdf4f549d4e96cdda06e7b20cf88ce2a

Okay I think I have a plan for this and want to run it by everyone. This solution will ensure that kibana monitoring collection will only poll for usage collector data at the same interval as the normal telemetry service.

  • Add a flag (isUsageCollector) to each usage collector, as an actual property similar to ignoreForInternalUploader so monitoring can know it's a usage collector (see why we have to do it this way)
  • Modify the kibana monitoring collection code to ONLY collect from usage collectors at the same rate as default telemetry (This will be kept in sync by exposing the interval through the xpack_main plugin). However, the monitoring code will pull from stats collectors at the current interval (default to 10s) to ensure the monitoring service and UIs continue to function properly. This will mean that some .monitoring-kibana-* documents will contain a usage field, and some will not.
  • Update the .monitoring-kibana index template to include mappings for the usage field (it currently does not) - Doing this will require a version dance as we want to ensure that folks don't run into a situation where Kibana is assuming this mapping exists but it doesn't yet in ES. This shouldn't happen as we explicitly recommend the monitoring cluster to be at least the same version as all monitored clusters
  • Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data (which will now occur at the same interval as default telemetry)

@Bamieh Bamieh added Team:Monitoring Stack Monitoring team Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Telemetry labels Apr 5, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform

@elasticmachine
Copy link
Contributor

💔 Build Failed

@Bamieh
Copy link
Member Author

Bamieh commented Apr 6, 2019

retest

@elasticmachine
Copy link
Contributor

💔 Build Failed

@peterschretlen
Copy link
Contributor

peterschretlen commented Apr 8, 2019

@Bamieh I think the test failure here might be legitimate? When we call the telemetry stats API /api/telemetry/v1/clusters/_stats makes a query filtering on {"exists":{"field":"kibana_stats.usage.index"}}. For me this is always returning no results, so the telemetry data does not have usage data.

(I have updated the monitoring-kibana template)

If I remove that filter clause, I can see the result set has documents with kibana_stats.usage.index

@peterschretlen
Copy link
Contributor

This query returns results:

GET .monitoring-kibana-7-2019.04.08/_search?filter_path=hits.hits._source.timestamp,hits.hits._source.kibana_stats.usage.index
{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "type": "kibana_stats"
                }
              }
            ]
          }
        },
        {
          "exists": {
            "field": "kibana_stats"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "timestamp": "desc"
    }
  ]
}

it will return the following (can see the mix of usage data and no usage data)

{
  "hits" : {
    "hits" : [
      {
        "_source" : {
          "timestamp" : "2019-04-08T18:16:06.754Z"
        }
      },
      {
        "_source" : {
          "timestamp" : "2019-04-08T18:15:56.750Z"
        }
      },
      {
        "_source" : {
          "timestamp" : "2019-04-08T18:15:46.786Z",
          "kibana_stats" : {
            "usage" : {
              "index" : ".kibana"
            }
          }
        }
      }
   ]
 }
}

But updating the filter to kibana_stats.usage.index gives back empty result set:

GET .monitoring-kibana-7-2019.04.08/_search?filter_path=hits.hits._source.timestamp,hits.hits._source.kibana_stats.usage.index
{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "type": "kibana_stats"
                }
              }
            ]
          }
        },
        {
          "exists": {
            "field": "kibana_stats.usage.index"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "timestamp": "desc"
    }
  ]
}

@peterschretlen
Copy link
Contributor

Never mind ... after deleting the index and recreating it, the queries work fine. It must have been created before I updated the template.

@peterschretlen peterschretlen self-requested a review April 8, 2019 18:42
Copy link
Contributor

@peterschretlen peterschretlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I tested both the collection and the telemetry locally, and working as expected. I think the integration test is failing for the same reason my local testing was failing for a while, indices weren't created with the new template. Once the ES changes show up in snapshots this should be good to go.

@peterschretlen
Copy link
Contributor

retest

@elasticmachine
Copy link
Contributor

💔 Build Failed

@peterschretlen
Copy link
Contributor

looks like the snapshot already has the new template. This is captured from the same ES snapshot running locally ( https://snapshots.elastic.co/8.0.0-55bab4fb/downloads/elasticsearch/elasticsearch-8.0.0-SNAPSHOT-darwin-x86_64.tar.gz )

$ curl -s --user elastic:changeme localhost:9220/_template/.monitoring-kibana | jq '.[".monitoring-kibana"].mappings.properties.kibana_stats.properties.usage'
{
  "properties": {
    "index": {
      "type": "keyword"
    }
  }
}

The failures are reproducible locally by running the api tests, so it is something other than the template.

@ycombinator
Copy link
Contributor

Hi, AFAICT the changes in this PR (and the original draft PR proposed by @chrisronline) will only affect internal collection, that is when Kibana is collecting usage metrics internally and bulk uploads them to the Elasticsearch POST _monitoring/bulk API endpoint.

However, the long term plan for Monitoring has been for Metricbeat to externally collect monitoring metrics from Kibana. Metricbeat does this by calling the GET api/stats?extended=true Kibana API and then sending the collected metrics to the Elasticsearch POST _bulk API endpoint.

If this PR is merged, the internal collection will work as we want it to. That is, usage will only be collected every telemetryCollectionInterval and bulk uploaded to the Elasticsearch POST _monitoring/bulk API endpoint. However, if the user sets up Metricbeat collection, I think usage will be collected every time Metricbeat calls the GET api/stats?extended=true Kibana API (default: every 10s), which then would defeat the goal of this PR.

@chrisronline Please correct me if I got any of the above wrong.

@chrisronline
Copy link
Contributor

@ycombinator That is correct. This is a general and reoccurring issue between MB and internal collection. We opened a ticket to discuss how to make this better: #34940

@peterschretlen
Copy link
Contributor

@Bamieh I was able to fix the functional test failure with these changes:
5dc9aa6

Copy link
Member

@pickypg pickypg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI passing.

Copy link
Contributor

@peterschretlen peterschretlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

@Bamieh Bamieh merged commit 27de17a into elastic:master Apr 27, 2019
Bamieh added a commit that referenced this pull request Apr 27, 2019
* Only collect usage data once a day from kibana monitoring

* isUsageCollector

* bulk uploader tests

* linting

* condition filter

* checkout yarn.lock from master

* update mappings for functional tests

* debugging for refactoring

* support legacy mappings

* self code review

* isUsageCollector

* fix mock

* live coding session with pickypg

* self code review

* Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_high_level_stats.js

Co-Authored-By: Bamieh <[email protected]>

* update mappings
Bamieh added a commit that referenced this pull request Apr 27, 2019
* Only collect usage data once a day from kibana monitoring

* isUsageCollector

* bulk uploader tests

* linting

* condition filter

* checkout yarn.lock from master

* update mappings for functional tests

* debugging for refactoring

* support legacy mappings

* self code review

* isUsageCollector

* fix mock

* live coding session with pickypg

* self code review

* Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_high_level_stats.js

Co-Authored-By: Bamieh <[email protected]>

* update mappings
Bamieh added a commit that referenced this pull request Apr 27, 2019
* Only collect usage data once a day from kibana monitoring

* isUsageCollector

* bulk uploader tests

* linting

* condition filter

* checkout yarn.lock from master

* update mappings for functional tests

* debugging for refactoring

* support legacy mappings

* self code review

* isUsageCollector

* fix mock

* live coding session with pickypg

* self code review

* Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_high_level_stats.js

Co-Authored-By: Bamieh <[email protected]>

* update mappings
peterschretlen pushed a commit that referenced this pull request Apr 27, 2019
* Only collect usage data once a day from kibana monitoring
peterschretlen pushed a commit that referenced this pull request Apr 27, 2019
* Monitoring telemetry collection interval (#34609)

* Only collect usage data once a day from kibana monitoring
Bamieh added a commit that referenced this pull request Apr 27, 2019
* Only collect usage data once a day from kibana monitoring

* isUsageCollector

* bulk uploader tests

* linting

* condition filter

* checkout yarn.lock from master

* update mappings for functional tests

* debugging for refactoring

* support legacy mappings

* self code review

* isUsageCollector

* fix mock

* live coding session with pickypg

* self code review

* Update x-pack/plugins/xpack_main/server/lib/telemetry/monitoring/get_high_level_stats.js

Co-Authored-By: Bamieh <[email protected]>

* update mappings
@ycombinator
Copy link
Contributor

I noticed that this PR has been labeled with v7.0.1. The 7.0.0 branch backport of this PR was merged a day ago, however the latest BC for 7.0.1 was made 2 days ago. Are we expecting a respin of 7.0.1 or should this PR be relabeled with v7.0.2 instead of v7.0.1?

@pickypg
Copy link
Member

pickypg commented Apr 28, 2019

Assuming we get a response respin (autocorrect!) of 6.7.2, we’ll get one for 7.0.1 as well.

@Bamieh
Copy link
Member Author

Bamieh commented Apr 29, 2019

Notes for QA

  1. Monitoring was collecting kibana usage from telemetry usage stats every 10 seconds, now it collects it every 1 day and saves it into the monitoring index every 1 day.
  2. Monitoring still saves other stats every 10 seconds. Which means you should see a new monitoring document every 10seconds but without the kibana usage stats. you should only see a doc containing kibana usage stats once per day
  3. when telemetry wants to report data while monitoring is enabled, it hits the monitoring index and filters for the document that contains the usage stats, and uses that to report the usage data.

Some cases for when monitoring will put the kibana_stats.usage along with the rest of the monitoring data.

  1. server off + monitoring enabled: when it turns on, it puts kibana_stats.usage into a doc, next time after 24 hours
  2. server on + monitoring disabled: when monitoring is turned on, it puts kibana_stats.usage into the first doc, 24 hours
  3. server restart + monitoring enabled: puts a doc with kibana_stats.usage regardless of last insert, next time after 24 hours
  4. server on + monitoring enabled + 24 hours passed: monitoring will put a new doc with kibana_stats.usage and restart the clock for the next 24 hours.
  5. From the browser telemetry will report kibana_stats.usage from last doc in monitoring index that contains it (you can check by going to the advanced settings > example of what we collect instead of waiting 24 hours).
  6. The token for browser tracking last time reported to telemetry is different from the monitoring token for tracking last time monitoring reported usage data to monitoring index (time is not synced between the two as they not related)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker Feature:Telemetry release_note:fix Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Monitoring Stack Monitoring team v6.7.2 v7.0.1 v7.2.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants