Two (supposedly identical) sets of Cloud Monitoring queries (with MQF and MQL) return (vastly) different results #289

lvaylet · 2022-11-02T19:30:31Z

slo-generator 2.3.2
python 3.10.7

.env file:

export GAE_PROJECT_ID=slo-generator-ci-a2b4
export GAE_MODULE_ID=default 
export STACKDRIVER_HOST_PROJECT_ID=slo-generator-ci-a2b4
export DATADOG_API_KEY=fake
export DATADOG_APP_KEY=fake
export DYNATRACE_API_URL=fake
export DYNATRACE_API_TOKEN=fake
export ELASTICSEARCH_URL=fake
export PROMETHEUS_URL=fake
export PROMETHEUS_PUSHGATEWAY_URL=fake
export PUBSUB_PROJECT_ID=fake
export PUBSUB_TOPIC_NAME=fake

The MQF queries return:

(venv) user@workstation-l9pyhi6x:~/workspace/github/google/slo-generator$  slo-generator compute --slo-config samples/cloud_monitoring/slo_gae_app_availability.yaml --config samples/config.yaml -t 1666995015.5144777
INFO - gae-app-availability             | 1 hour   | SLI: 73.5294 % | SLO: 95.0 % | Gap: -21.47% | BR: 5.3 / 9.0 | Alert: 0 | Good: 50       | Bad: 18      
INFO - gae-app-availability             | 12 hours | SLI: 71.6648 % | SLO: 95.0 % | Gap: -23.34% | BR: 5.7 / 3.0 | Alert: 1 | Good: 650      | Bad: 257     
INFO - gae-app-availability             | 7 days   | SLI: 11.407  % | SLO: 95.0 % | Gap: -83.59% | BR: 17.7 / 1.5 | Alert: 1 | Good: 1418     | Bad: 11013   
INFO - gae-app-availability             | 28 days  | SLI: 2.7635  % | SLO: 95.0 % | Gap: -92.24% | BR: 19.4 / 1.0 | Alert: 1 | Good: 1418     | Bad: 49893   
INFO - Run finished successfully in 8.1s.
INFO - Run summary | SLO Configs: 1 | Duration: 8.1s

The MQL queries (with the same timestamp) return:

(venv) user@workstation-l9pyhi6x:~/workspace/github/google/slo-generator$ slo-generator compute -f samples/cloud_monitoring_mql/slo_gae_app_availability.yaml -c samples/config.yaml -t 1666995015.5144777
INFO - gae-app-availability             | 1 hour   | SLI: 67.5676 % | SLO: 95.0 % | Gap: -27.43% | BR: 6.5 / 9.0 | Alert: 0 | Good: 50       | Bad: 24      
INFO - gae-app-availability             | 12 hours | SLI: 68.0    % | SLO: 95.0 % | Gap: -27.0 % | BR: 6.4 / 3.0 | Alert: 1 | Good: 629      | Bad: 296     
INFO - gae-app-availability             | 7 days   | SLI: 59.3401 % | SLO: 95.0 % | Gap: -35.66% | BR: 8.1 / 1.5 | Alert: 1 | Good: 7554     | Bad: 5176    
INFO - gae-app-availability             | 28 days  | SLI: 14.6705 % | SLO: 95.0 % | Gap: -80.33% | BR: 17.1 / 1.0 | Alert: 1 | Good: 7554     | Bad: 43937   
INFO - Run finished successfully in 8.9s.
INFO - Run summary | SLO Configs: 1 | Duration: 8.9s

Window	Backend	Good Events	Bad Events
1 hour	MQF	50	18
1 hour	MQL	50	24
12 hours	MQF	650	257
12 hours	MQL	629	296
7 days	MQF	1418	11013
7 days	MQL	7554	5176
28 days	MQF	1418	49893
28 days	MQL	7554	43937

The SLO configs look identical though:

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: gae-app-availability
  labels:
    service_name: gae
    feature_name: app
    slo_name: availability
spec:
  description: Availability of App Engine app
  backend: cloud_monitoring
  method: good_bad_ratio
  exporters:
  - cloud_monitoring
  service_level_indicator:
    filter_good: >
      project=${GAE_PROJECT_ID}
      metric.type="appengine.googleapis.com/http/server/response_count"
      resource.type="gae_app"
      ( metric.labels.response_code = 429 OR
        metric.labels.response_code = 200 OR
        metric.labels.response_code = 201 OR
        metric.labels.response_code = 202 OR
        metric.labels.response_code = 203 OR
        metric.labels.response_code = 204 OR
        metric.labels.response_code = 205 OR
        metric.labels.response_code = 206 OR
        metric.labels.response_code = 207 OR
        metric.labels.response_code = 208 OR
        metric.labels.response_code = 226 OR
        metric.labels.response_code = 304 )
    filter_valid: >
      project=${GAE_PROJECT_ID}
      metric.type="appengine.googleapis.com/http/server/response_count"
  goal: 0.95

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: gae-app-availability
  labels:
    service_name: gae
    feature_name: app
    slo_name: availability
spec:
  description: Availability of App Engine app
  backend: cloud_monitoring_mql
  method: good_bad_ratio
  exporters:
  - cloud_monitoring
  service_level_indicator:
    filter_good: >
      fetch gae_app
      | metric 'appengine.googleapis.com/http/server/response_count'
      | filter resource.project_id == '${GAE_PROJECT_ID}'
      | filter
          metric.response_code == 429
          || metric.response_code == 200
          || metric.response_code == 201
          || metric.response_code == 202
          || metric.response_code == 203
          || metric.response_code == 204
          || metric.response_code == 205
          || metric.response_code == 206
          || metric.response_code == 207
          || metric.response_code == 208
          || metric.response_code == 226
          || metric.response_code == 304
    filter_valid: >
      fetch gae_app
      | metric 'appengine.googleapis.com/http/server/response_count'
      | filter resource.project_id == '${GAE_PROJECT_ID}'
  goal: 0.95

The same discrepancies can be observed on a different project (slo-generator-demo for example) and/or a different GAE service.

The text was updated successfully, but these errors were encountered:

lvaylet · 2022-11-02T19:47:02Z

A standalone test script saved as test_mqf_vs_mql.py confirms the observed behavior:

"""Make sure requests to Cloud Monitoring with MQF and MQL return the same results."""
from datetime import datetime

from slo_generator.backends.cloud_monitoring import CloudMonitoringBackend
from slo_generator.backends.cloud_monitoring_mql import CloudMonitoringMqlBackend

PROJECT_ID: str = "slo-generator-demo"

mqf_backend = CloudMonitoringBackend(PROJECT_ID)
mql_backend = CloudMonitoringMqlBackend(PROJECT_ID)

# Use a specific, fixed timestamp for both queries and the same window.
# Python uses floating point numbers to represent time in seconds since the epoch, in
# UTC, with decimal part representing nanoseconds.
end_time: float = 1666995015.5144777
# MQL expects dates formatted like "%Y/%m/%d %H:%M:%S" or "%Y/%m/%d-%H:%M:%S".
# Reference: https://cloud.google.com/monitoring/mql/reference#lexical-elements
end_time_str: str = datetime.fromtimestamp(end_time).strftime("%Y/%m/%d %H:%M:%S")
assert end_time_str == "2022/10/28 22:10:15"  # MQL ignores nanoseconds

mqf_query: str = """resource.type="gae_app"
metric.type="appengine.googleapis.com/http/server/response_count"
resource.labels.project_id="slo-generator-demo"
resource.labels.module_id="ratingservice"
( metric.labels.response_code = 429 OR
  metric.labels.response_code = 200 OR
  metric.labels.response_code = 201 OR
  metric.labels.response_code = 202 OR
  metric.labels.response_code = 203 OR
  metric.labels.response_code = 204 OR
  metric.labels.response_code = 205 OR
  metric.labels.response_code = 206 OR
  metric.labels.response_code = 207 OR
  metric.labels.response_code = 208 OR
  metric.labels.response_code = 226 OR
  metric.labels.response_code = 304 )
"""
mql_query: str = """fetch gae_app
| metric 'appengine.googleapis.com/http/server/response_count'
| filter resource.project_id == 'slo-generator-demo'
| filter resource.module_id == 'ratingservice'
| filter
    metric.response_code == 429
    || metric.response_code == 200
    || metric.response_code == 201
    || metric.response_code == 202
    || metric.response_code == 203
    || metric.response_code == 204
    || metric.response_code == 205
    || metric.response_code == 206
    || metric.response_code == 207
    || metric.response_code == 208
    || metric.response_code == 226
    || metric.response_code == 304
"""

for window in [3600, 24 * 3600, 7 * 24 * 3600, 28 * 24 * 3600, 6 * 28 * 24 * 3600]:
    mqf_count = mqf_backend.count(list(mqf_backend.query(end_time, window, mqf_query)))
    mql_count = mql_backend.count(list(mql_backend.query(mql_query, window)))
    print(f"window = {window}s | MQF: {mqf_count} | MQL: {mql_count}")

The output is:

(venv) user@workstation-l9pyhi6x:~/workspace/github/google/slo-generator$ python test_mqf_vs_mql.py
window = 3600s | MQF: 1124 | MQL: 1139
window = 86400s | MQF: 27371 | MQL: 27398
window = 604800s | MQF: 191674 | MQL: 191654
window = 2419200s | MQF: 767616 | MQL: 767165
window = 14515200s | MQF: 1401063 | MQL: 1535099

lvaylet · 2022-11-02T20:35:10Z

The changes from #290 seem to fix the issue. The small discrepancies observed with 7-day and 6-month windows (1 event and 5 events, respectively) represent a negligible percentage of the actual number of events.

(venv) user@workstation-l9pyhi6x:~/workspace/github/google/slo-generator$ python test_mqf_vs_mql.py
window = 3600s | MQF: 1124 | MQL: 1124
window = 86400s | MQF: 27371 | MQL: 27371
window = 604800s | MQF: 191674 | MQL: 191675
window = 2419200s | MQF: 767616 | MQL: 767616
window = 14515200s | MQF: 1401063 | MQL: 1401068

lvaylet self-assigned this Nov 2, 2022

lvaylet added bug Something isn't working p/cloud_monitoring Cloud Monitoring provider issue backend p/cloud_monitoring_mql labels Nov 2, 2022

lvaylet linked a pull request Nov 2, 2022 that will close this issue

fix: compute the time horizon of MQL requests more accurately so they return the same results as MQF requests #290

Merged

lvaylet closed this as completed in #290 Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two (supposedly identical) sets of Cloud Monitoring queries (with MQF and MQL) return (vastly) different results #289

Two (supposedly identical) sets of Cloud Monitoring queries (with MQF and MQL) return (vastly) different results #289

lvaylet commented Nov 2, 2022

lvaylet commented Nov 2, 2022

lvaylet commented Nov 2, 2022

Two (supposedly identical) sets of Cloud Monitoring queries (with MQF and MQL) return (vastly) different results #289

Two (supposedly identical) sets of Cloud Monitoring queries (with MQF and MQL) return (vastly) different results #289

Comments

lvaylet commented Nov 2, 2022

lvaylet commented Nov 2, 2022

lvaylet commented Nov 2, 2022