Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] Add aggregation aligner as a config param for stackdriver metricset in GCP #17719

Merged
merged 19 commits into from
Apr 24, 2020
Merged

[Metricbeat] Add aggregation aligner as a config param for stackdriver metricset in GCP #17719

merged 19 commits into from
Apr 24, 2020

Conversation

kaiyan-sheng
Copy link
Contributor

@kaiyan-sheng kaiyan-sheng commented Apr 15, 2020

What does this PR do?

  • This PR is to add aligner config parameter for stackdriver metriset under metrics.
  • Add suffix to metric names to show what aligner is used. For example: cpu.utilization.avg for ALIGN_MEAN, cpu.utilization.sum for ALIGN_SUM and cpu.utilization.value for ALIGN_NONE.
  • Take into account GCP has a ingest delay for monitoring metrics to show up in StackDriver.
  • Also user can specify collection period to be 1 min as the minimum instead of 5 min.
  • Fixed metricbeat/docs/fields.asciidoc to include mappings from googlecloud module.
  • Removed data-*.json files that are not generated by integration test TestData for each metricset.

Why is it important?

ListMetricDescriptors API is used to get metadata sample period and ingest delay for each metric type once at the start of this module. If sample period is smaller than the collection period, aggregation will be used in ListTimeSeries API. By default, aligner is ALIGN_NONE. This means if user specify this Metricbeat collection period to be 5m and the metric type sample period is 60s, then Metricbeat will return 5 raw data points (1 for each minute) in one ListTimeSeries API call. This will save cost significantly if user does not mind the extra delay. If user wants to only return one aggregated metric per collection period, aligner can be specified, such as ALIGN_MEAN, ALIGN_SUM and etc.

Monitoring collects one measurement each minute (the sampling rate), but it can take up to 4 minutes before you can retrieve the data (latency). In order to make sure the collection is successful, we delay collection startTime and endTime for a number of minutes defined by ingest delay every time. Instead of hardcoding ingest delay to 4 minute, this is obtained from ListMetricDescriptors API for each metric type.

Assume ingest delay = 4-minute, sample period = 1-minute and collection period = 1-minute, when querying GCP API timeSeries.list, the time interval changed to:

current timestamp startTime endTime
01:00 00:55 00:56
01:01 00:56 00:57
01:02 00:57 00:58
01:03 00:58 00:59
01:04 00:59 01:00

Therefore, data collection will always have a delay. This is consistent with monitoring in GCP portal.

Assume ingest delay = 4-minute, sample period = 5-minute, aggregation aligner is ALIGN_MEAN and collection period = 5-minute, when querying GCP API timeSeries.list, the time interval changed to:

current timestamp startTime endTime
01:00 00:51 00:56
01:05 00:56 01:01
01:10 01:01 01:06
01:15 01:06 01:11
01:20 01:11 01:16

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Two test cases here:

Use config below and you should see 5 metrics every 5 minutes:

- module: googlecloud
  metricsets:
    - stackdriver
  zone: "europe-west1-c"
  project_id: elastic-observability
  credentials_file_path: "/Users/kaiyansheng/Downloads/elastic-observability.json"
  exclude_labels: false
  period: 300s
  stackdriver:
    service: compute
    metrics:
      - aligner: ALIGN_MEAN
        metric_types:
          - "compute.googleapis.com/instance/cpu/usage_time"
          - "compute.googleapis.com/instance/cpu/utilization"
      - aligner: ALIGN_SUM
        metric_types:
          - "compute.googleapis.com/instance/uptime"

Screen Shot 2020-04-16 at 4 24 29 PM

and output event looks like this:

{
  "_source": {
    "@timestamp": "2020-04-21T22:18:48.000Z",
    "googlecloud": {
      "stackdriver": {
        "instance": {
          "uptime": {
            "sum": 300
          },
          "cpu": {
            "usage_time": {
              "avg": 136.33796208361164
            },
            "utilization": {
              "avg": 0.5680748420150485
            }
          }
        }
      }
    }
  }
}

Use config below and you should see 1 metric every 1 minute:

- module: googlecloud
  metricsets:
    - stackdriver
  zone: "europe-west1-c"
  project_id: elastic-observability
  credentials_file_path: "/Users/kaiyansheng/Downloads/elastic-observability.json"
  exclude_labels: false
  period: 60s
  stackdriver:
    service: compute
    metrics:
      - metric_types:
          - "compute.googleapis.com/instance/cpu/usage_time"
          - "compute.googleapis.com/instance/cpu/utilization"
          - "compute.googleapis.com/instance/uptime"

Screen Shot 2020-04-16 at 4 36 14 PM

and output event looks like this:

{
  "_source": {
    "@timestamp": "2020-04-21T22:31:00.000Z",
    "cloud.availability_zone": "europe-west1-c",
    "googlecloud": {
      "stackdriver": {
        "instance": {
          "uptime": {
            "raw": 60
          },
          "cpu": {
            "usage_time": {
              "raw": 148.0562956505455
            },
            "utilization": {
              "raw": 0.616901231877273
            }
          }
        }
      }
    }
  }
}

Related issues

TODOs

This PR is getting too big so I will list things need to be done in separate PRs:

- module: googlecloud
  metricsets:
    - stackdriver
  zone: "europe-west1-c"
  project_id: elastic-observability
  credentials_file_path: "/Users/kaiyansheng/Downloads/elastic-observability.json"
  exclude_labels: false
  period: 60s
  metrics:
    - service: compute
      metric_types:
        - "compute.googleapis.com/instance/cpu/usage_time"
        - "compute.googleapis.com/instance/cpu/utilization"
        - "compute.googleapis.com/instance/uptime"
  • Improve TestData to generate data.json files for different metric types inside each metricset.
  • Pagination for ListTimeSeries API results.

@kaiyan-sheng kaiyan-sheng self-assigned this Apr 15, 2020
@kaiyan-sheng kaiyan-sheng added Metricbeat Metricbeat needs_backport PR is waiting to be backported to other branches. Team:Platforms Label for the Integrations - Platforms team test-plan Add this PR to be manual test plan labels Apr 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@kaiyan-sheng kaiyan-sheng requested review from sayden and exekias April 15, 2020 15:45
@kaiyan-sheng kaiyan-sheng added the in progress Pull request is currently in progress. label Apr 15, 2020
@exekias
Copy link
Contributor

exekias commented Apr 16, 2020

I like the new approach! is it working as expected? The PR description will need to be updated

@kaiyan-sheng kaiyan-sheng changed the title [Metricbeat] Collect one metric per collection period for GCP [Metricbeat] Add aggregation aligner as a config param for stackdriver metricset in GCP Apr 21, 2020
@kaiyan-sheng kaiyan-sheng removed the in progress Pull request is currently in progress. label Apr 21, 2020

//stackDriverConfig holds a configuration specific for stackdriver metricset.
type stackDriverConfig struct {
MetricTypes []string `config:"metric_types" validate:"required"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not actually types but metric names right? how about using names here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm to be consistent with GCP, it is metric.type. metric.type is a part of metric selector, which includes its DNS name prefix. Here is an exmaple:

metric.type = "compute.googleapis.com/instance/cpu/usage_time" AND
    metric.labels.instance_name = "my-instance-name"

@kaiyan-sheng
Copy link
Contributor Author

CI failures are not seem to be related.

@kaiyan-sheng kaiyan-sheng merged commit 98f02e1 into elastic:master Apr 24, 2020
@kaiyan-sheng kaiyan-sheng deleted the gcp_collection_period branch April 24, 2020 14:11
@kaiyan-sheng kaiyan-sheng added v7.8.0 and removed needs_backport PR is waiting to be backported to other branches. labels Apr 24, 2020
kaiyan-sheng added a commit that referenced this pull request Apr 25, 2020
…r metricset in GCP (#17719) (#17979)

* Add metricDescriptor to get sample period and ingest delay time
* add aggregation for ListTimeSeriesRequest
* Add aligner into metric name suffix (eg: .avg, .sum)

(cherry picked from commit 98f02e1)
@andresrc andresrc added the test-plan-added This PR has been added to the test plan label May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metricbeat Metricbeat review Team:Platforms Label for the Integrations - Platforms team test-plan Add this PR to be manual test plan test-plan-added This PR has been added to the test plan v7.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Metricbeat] Investigate collection period=1m for googlecloud module
4 participants