Cherry-pick #17719 to 7.x: [Metricbeat] Add aggregation aligner as a config param for stackdriver metricset in GCP #17979

kaiyan-sheng · 2020-04-24T18:57:18Z

Cherry-pick of PR #17719 to 7.x branch. Original message:

What does this PR do?

This PR is to add aligner config parameter for stackdriver metriset under metrics.
Add suffix to metric names to show what aligner is used. For example: cpu.utilization.avg for ALIGN_MEAN, cpu.utilization.sum for ALIGN_SUM and cpu.utilization.value for ALIGN_NONE.
Take into account GCP has a ingest delay for monitoring metrics to show up in StackDriver.
Also user can specify collection period to be 1 min as the minimum instead of 5 min.
Fixed metricbeat/docs/fields.asciidoc to include mappings from googlecloud module.
Removed data-*.json files that are not generated by integration test TestData for each metricset.

Why is it important?

ListMetricDescriptors API is used to get metadata sample period and ingest delay for each metric type once at the start of this module. If sample period is smaller than the collection period, aggregation will be used in ListTimeSeries API. By default, aligner is ALIGN_NONE. This means if user specify this Metricbeat collection period to be 5m and the metric type sample period is 60s, then Metricbeat will return 5 raw data points (1 for each minute) in one ListTimeSeries API call. This will save cost significantly if user does not mind the extra delay. If user wants to only return one aggregated metric per collection period, aligner can be specified, such as ALIGN_MEAN, ALIGN_SUM and etc.

Monitoring collects one measurement each minute (the sampling rate), but it can take up to 4 minutes before you can retrieve the data (latency). In order to make sure the collection is successful, we delay collection startTime and endTime for a number of minutes defined by ingest delay every time. Instead of hardcoding ingest delay to 4 minute, this is obtained from ListMetricDescriptors API for each metric type.

Assume ingest delay = 4-minute, sample period = 1-minute and collection period = 1-minute, when querying GCP API timeSeries.list, the time interval changed to:

current timestamp	startTime	endTime
01:00	00:55	00:56
01:01	00:56	00:57
01:02	00:57	00:58
01:03	00:58	00:59
01:04	00:59	01:00

Therefore, data collection will always have a delay. This is consistent with monitoring in GCP portal.

Assume ingest delay = 4-minute, sample period = 5-minute, aggregation aligner is ALIGN_MEAN and collection period = 5-minute, when querying GCP API timeSeries.list, the time interval changed to:

current timestamp	startTime	endTime
01:00	00:51	00:56
01:05	00:56	01:01
01:10	01:01	01:06
01:15	01:06	01:11
01:20	01:11	01:16

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Two test cases here:

Use config below and you should see 5 metrics every 5 minutes:

- module: googlecloud
  metricsets:
    - stackdriver
  zone: "europe-west1-c"
  project_id: elastic-observability
  credentials_file_path: "/Users/kaiyansheng/Downloads/elastic-observability.json"
  exclude_labels: false
  period: 300s
  stackdriver:
    service: compute
    metrics:
      - aligner: ALIGN_MEAN
        metric_types:
          - "compute.googleapis.com/instance/cpu/usage_time"
          - "compute.googleapis.com/instance/cpu/utilization"
      - aligner: ALIGN_SUM
        metric_types:
          - "compute.googleapis.com/instance/uptime"

and output event looks like this:

{
  "_source": {
    "@timestamp": "2020-04-21T22:18:48.000Z",
    "googlecloud": {
      "stackdriver": {
        "instance": {
          "uptime": {
            "sum": 300
          },
          "cpu": {
            "usage_time": {
              "avg": 136.33796208361164
            },
            "utilization": {
              "avg": 0.5680748420150485
            }
          }
        }
      }
    }
  }
}

Use config below and you should see 1 metric every 1 minute:

- module: googlecloud
  metricsets:
    - stackdriver
  zone: "europe-west1-c"
  project_id: elastic-observability
  credentials_file_path: "/Users/kaiyansheng/Downloads/elastic-observability.json"
  exclude_labels: false
  period: 60s
  stackdriver:
    service: compute
    metrics:
      - metric_types:
          - "compute.googleapis.com/instance/cpu/usage_time"
          - "compute.googleapis.com/instance/cpu/utilization"
          - "compute.googleapis.com/instance/uptime"

and output event looks like this:

{
  "_source": {
    "@timestamp": "2020-04-21T22:31:00.000Z",
    "cloud.availability_zone": "europe-west1-c",
    "googlecloud": {
      "stackdriver": {
        "instance": {
          "uptime": {
            "raw": 60
          },
          "cpu": {
            "usage_time": {
              "raw": 148.0562956505455
            },
            "utilization": {
              "raw": 0.616901231877273
            }
          }
        }
      }
    }
  }
}

Related issues

Closes [Metricbeat] Investigate collection period=1m for googlecloud module #17141

TODOs

This PR is getting too big so I will list things need to be done in separate PRs:

Investigate distribution type value in google cloud:

beats/x-pack/metricbeat/module/googlecloud/stackdriver/response_parser.go

Line 99 in 95626b8

case *monitoring.TypedValue_DistributionValue:
change aligner to aligners for a list of strings
move service into metrics config and change config to look like this:

- module: googlecloud
  metricsets:
    - stackdriver
  zone: "europe-west1-c"
  project_id: elastic-observability
  credentials_file_path: "/Users/kaiyansheng/Downloads/elastic-observability.json"
  exclude_labels: false
  period: 60s
  metrics:
    - service: compute
      metric_types:
        - "compute.googleapis.com/instance/cpu/usage_time"
        - "compute.googleapis.com/instance/cpu/utilization"
        - "compute.googleapis.com/instance/uptime"

Improve TestData to generate data.json files for different metric types inside each metricset.
Pagination for ListTimeSeries API results.

…r metricset in GCP (#17719) * Add metricDescriptor to get sample period and ingest delay time * add aggregation for ListTimeSeriesRequest * Add aligner into metric name suffix (eg: .avg, .sum) (cherry picked from commit 98f02e1)

elasticmachine · 2020-04-24T18:57:39Z

Pinging @elastic/integrations-platforms (Team:Platforms)

kaiyan-sheng added backport review labels Apr 24, 2020

kaiyan-sheng self-assigned this Apr 24, 2020

kaiyan-sheng added the Team:Platforms Label for the Integrations - Platforms team label Apr 24, 2020

andrewkroh approved these changes Apr 24, 2020

View reviewed changes

kaiyan-sheng merged commit ec858bd into elastic:7.x Apr 25, 2020

kaiyan-sheng deleted the backport_17719_7.x branch April 25, 2020 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick #17719 to 7.x: [Metricbeat] Add aggregation aligner as a config param for stackdriver metricset in GCP #17979

Cherry-pick #17719 to 7.x: [Metricbeat] Add aggregation aligner as a config param for stackdriver metricset in GCP #17979

kaiyan-sheng commented Apr 24, 2020

elasticmachine commented Apr 24, 2020

Cherry-pick #17719 to 7.x: [Metricbeat] Add aggregation aligner as a config param for stackdriver metricset in GCP #17979

Cherry-pick #17719 to 7.x: [Metricbeat] Add aggregation aligner as a config param for stackdriver metricset in GCP #17979

Conversation

kaiyan-sheng commented Apr 24, 2020

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Use config below and you should see 5 metrics every 5 minutes:

Use config below and you should see 1 metric every 1 minute:

Related issues

TODOs

elasticmachine commented Apr 24, 2020