Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick #17719 to 7.x: [Metricbeat] Add aggregation aligner as a config param for stackdriver metricset in GCP #17979

Merged
merged 1 commit into from
Apr 25, 2020

Conversation

kaiyan-sheng
Copy link
Contributor

Cherry-pick of PR #17719 to 7.x branch. Original message:

What does this PR do?

  • This PR is to add aligner config parameter for stackdriver metriset under metrics.
  • Add suffix to metric names to show what aligner is used. For example: cpu.utilization.avg for ALIGN_MEAN, cpu.utilization.sum for ALIGN_SUM and cpu.utilization.value for ALIGN_NONE.
  • Take into account GCP has a ingest delay for monitoring metrics to show up in StackDriver.
  • Also user can specify collection period to be 1 min as the minimum instead of 5 min.
  • Fixed metricbeat/docs/fields.asciidoc to include mappings from googlecloud module.
  • Removed data-*.json files that are not generated by integration test TestData for each metricset.

Why is it important?

ListMetricDescriptors API is used to get metadata sample period and ingest delay for each metric type once at the start of this module. If sample period is smaller than the collection period, aggregation will be used in ListTimeSeries API. By default, aligner is ALIGN_NONE. This means if user specify this Metricbeat collection period to be 5m and the metric type sample period is 60s, then Metricbeat will return 5 raw data points (1 for each minute) in one ListTimeSeries API call. This will save cost significantly if user does not mind the extra delay. If user wants to only return one aggregated metric per collection period, aligner can be specified, such as ALIGN_MEAN, ALIGN_SUM and etc.

Monitoring collects one measurement each minute (the sampling rate), but it can take up to 4 minutes before you can retrieve the data (latency). In order to make sure the collection is successful, we delay collection startTime and endTime for a number of minutes defined by ingest delay every time. Instead of hardcoding ingest delay to 4 minute, this is obtained from ListMetricDescriptors API for each metric type.

Assume ingest delay = 4-minute, sample period = 1-minute and collection period = 1-minute, when querying GCP API timeSeries.list, the time interval changed to:

current timestamp startTime endTime
01:00 00:55 00:56
01:01 00:56 00:57
01:02 00:57 00:58
01:03 00:58 00:59
01:04 00:59 01:00

Therefore, data collection will always have a delay. This is consistent with monitoring in GCP portal.

Assume ingest delay = 4-minute, sample period = 5-minute, aggregation aligner is ALIGN_MEAN and collection period = 5-minute, when querying GCP API timeSeries.list, the time interval changed to:

current timestamp startTime endTime
01:00 00:51 00:56
01:05 00:56 01:01
01:10 01:01 01:06
01:15 01:06 01:11
01:20 01:11 01:16

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Two test cases here:

Use config below and you should see 5 metrics every 5 minutes:

- module: googlecloud
  metricsets:
    - stackdriver
  zone: "europe-west1-c"
  project_id: elastic-observability
  credentials_file_path: "/Users/kaiyansheng/Downloads/elastic-observability.json"
  exclude_labels: false
  period: 300s
  stackdriver:
    service: compute
    metrics:
      - aligner: ALIGN_MEAN
        metric_types:
          - "compute.googleapis.com/instance/cpu/usage_time"
          - "compute.googleapis.com/instance/cpu/utilization"
      - aligner: ALIGN_SUM
        metric_types:
          - "compute.googleapis.com/instance/uptime"

Screen Shot 2020-04-16 at 4 24 29 PM

and output event looks like this:

{
  "_source": {
    "@timestamp": "2020-04-21T22:18:48.000Z",
    "googlecloud": {
      "stackdriver": {
        "instance": {
          "uptime": {
            "sum": 300
          },
          "cpu": {
            "usage_time": {
              "avg": 136.33796208361164
            },
            "utilization": {
              "avg": 0.5680748420150485
            }
          }
        }
      }
    }
  }
}

Use config below and you should see 1 metric every 1 minute:

- module: googlecloud
  metricsets:
    - stackdriver
  zone: "europe-west1-c"
  project_id: elastic-observability
  credentials_file_path: "/Users/kaiyansheng/Downloads/elastic-observability.json"
  exclude_labels: false
  period: 60s
  stackdriver:
    service: compute
    metrics:
      - metric_types:
          - "compute.googleapis.com/instance/cpu/usage_time"
          - "compute.googleapis.com/instance/cpu/utilization"
          - "compute.googleapis.com/instance/uptime"

Screen Shot 2020-04-16 at 4 36 14 PM

and output event looks like this:

{
  "_source": {
    "@timestamp": "2020-04-21T22:31:00.000Z",
    "cloud.availability_zone": "europe-west1-c",
    "googlecloud": {
      "stackdriver": {
        "instance": {
          "uptime": {
            "raw": 60
          },
          "cpu": {
            "usage_time": {
              "raw": 148.0562956505455
            },
            "utilization": {
              "raw": 0.616901231877273
            }
          }
        }
      }
    }
  }
}

Related issues

TODOs

This PR is getting too big so I will list things need to be done in separate PRs:

- module: googlecloud
  metricsets:
    - stackdriver
  zone: "europe-west1-c"
  project_id: elastic-observability
  credentials_file_path: "/Users/kaiyansheng/Downloads/elastic-observability.json"
  exclude_labels: false
  period: 60s
  metrics:
    - service: compute
      metric_types:
        - "compute.googleapis.com/instance/cpu/usage_time"
        - "compute.googleapis.com/instance/cpu/utilization"
        - "compute.googleapis.com/instance/uptime"
  • Improve TestData to generate data.json files for different metric types inside each metricset.
  • Pagination for ListTimeSeries API results.

…r metricset in GCP (#17719)

* Add metricDescriptor to get sample period and ingest delay time
* add aggregation for ListTimeSeriesRequest
* Add aligner into metric name suffix (eg: .avg, .sum)

(cherry picked from commit 98f02e1)
@kaiyan-sheng kaiyan-sheng self-assigned this Apr 24, 2020
@kaiyan-sheng kaiyan-sheng added the Team:Platforms Label for the Integrations - Platforms team label Apr 24, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@kaiyan-sheng kaiyan-sheng merged commit ec858bd into elastic:7.x Apr 25, 2020
@kaiyan-sheng kaiyan-sheng deleted the backport_17719_7.x branch April 25, 2020 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport review Team:Platforms Label for the Integrations - Platforms team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants