Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two (supposedly identical) sets of Cloud Monitoring queries (with MQF and MQL) return (vastly) different results #289

Closed
lvaylet opened this issue Nov 2, 2022 · 2 comments · Fixed by #290
Assignees
Labels
backend bug Something isn't working p/cloud_monitoring_mql p/cloud_monitoring Cloud Monitoring provider issue

Comments

@lvaylet
Copy link
Collaborator

lvaylet commented Nov 2, 2022

slo-generator 2.3.2
python 3.10.7

.env file:

export GAE_PROJECT_ID=slo-generator-ci-a2b4
export GAE_MODULE_ID=default 
export STACKDRIVER_HOST_PROJECT_ID=slo-generator-ci-a2b4
export DATADOG_API_KEY=fake
export DATADOG_APP_KEY=fake
export DYNATRACE_API_URL=fake
export DYNATRACE_API_TOKEN=fake
export ELASTICSEARCH_URL=fake
export PROMETHEUS_URL=fake
export PROMETHEUS_PUSHGATEWAY_URL=fake
export PUBSUB_PROJECT_ID=fake
export PUBSUB_TOPIC_NAME=fake

The MQF queries return:

(venv) user@workstation-l9pyhi6x:~/workspace/github/google/slo-generator$  slo-generator compute --slo-config samples/cloud_monitoring/slo_gae_app_availability.yaml --config samples/config.yaml -t 1666995015.5144777
INFO - gae-app-availability             | 1 hour   | SLI: 73.5294 % | SLO: 95.0 % | Gap: -21.47% | BR: 5.3 / 9.0 | Alert: 0 | Good: 50       | Bad: 18      
INFO - gae-app-availability             | 12 hours | SLI: 71.6648 % | SLO: 95.0 % | Gap: -23.34% | BR: 5.7 / 3.0 | Alert: 1 | Good: 650      | Bad: 257     
INFO - gae-app-availability             | 7 days   | SLI: 11.407  % | SLO: 95.0 % | Gap: -83.59% | BR: 17.7 / 1.5 | Alert: 1 | Good: 1418     | Bad: 11013   
INFO - gae-app-availability             | 28 days  | SLI: 2.7635  % | SLO: 95.0 % | Gap: -92.24% | BR: 19.4 / 1.0 | Alert: 1 | Good: 1418     | Bad: 49893   
INFO - Run finished successfully in 8.1s.
INFO - Run summary | SLO Configs: 1 | Duration: 8.1s

The MQL queries (with the same timestamp) return:

(venv) user@workstation-l9pyhi6x:~/workspace/github/google/slo-generator$ slo-generator compute -f samples/cloud_monitoring_mql/slo_gae_app_availability.yaml -c samples/config.yaml -t 1666995015.5144777
INFO - gae-app-availability             | 1 hour   | SLI: 67.5676 % | SLO: 95.0 % | Gap: -27.43% | BR: 6.5 / 9.0 | Alert: 0 | Good: 50       | Bad: 24      
INFO - gae-app-availability             | 12 hours | SLI: 68.0    % | SLO: 95.0 % | Gap: -27.0 % | BR: 6.4 / 3.0 | Alert: 1 | Good: 629      | Bad: 296     
INFO - gae-app-availability             | 7 days   | SLI: 59.3401 % | SLO: 95.0 % | Gap: -35.66% | BR: 8.1 / 1.5 | Alert: 1 | Good: 7554     | Bad: 5176    
INFO - gae-app-availability             | 28 days  | SLI: 14.6705 % | SLO: 95.0 % | Gap: -80.33% | BR: 17.1 / 1.0 | Alert: 1 | Good: 7554     | Bad: 43937   
INFO - Run finished successfully in 8.9s.
INFO - Run summary | SLO Configs: 1 | Duration: 8.9s
Window Backend Good Events Bad Events
1 hour MQF 50 18
1 hour MQL 50 24
12 hours MQF 650 257
12 hours MQL 629 296
7 days MQF 1418 11013
7 days MQL 7554 5176
28 days MQF 1418 49893
28 days MQL 7554 43937

The SLO configs look identical though:

apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: gae-app-availability
  labels:
    service_name: gae
    feature_name: app
    slo_name: availability
spec:
  description: Availability of App Engine app
  backend: cloud_monitoring
  method: good_bad_ratio
  exporters:
  - cloud_monitoring
  service_level_indicator:
    filter_good: >
      project=${GAE_PROJECT_ID}
      metric.type="appengine.googleapis.com/http/server/response_count"
      resource.type="gae_app"
      ( metric.labels.response_code = 429 OR
        metric.labels.response_code = 200 OR
        metric.labels.response_code = 201 OR
        metric.labels.response_code = 202 OR
        metric.labels.response_code = 203 OR
        metric.labels.response_code = 204 OR
        metric.labels.response_code = 205 OR
        metric.labels.response_code = 206 OR
        metric.labels.response_code = 207 OR
        metric.labels.response_code = 208 OR
        metric.labels.response_code = 226 OR
        metric.labels.response_code = 304 )
    filter_valid: >
      project=${GAE_PROJECT_ID}
      metric.type="appengine.googleapis.com/http/server/response_count"
  goal: 0.95
apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: gae-app-availability
  labels:
    service_name: gae
    feature_name: app
    slo_name: availability
spec:
  description: Availability of App Engine app
  backend: cloud_monitoring_mql
  method: good_bad_ratio
  exporters:
  - cloud_monitoring
  service_level_indicator:
    filter_good: >
      fetch gae_app
      | metric 'appengine.googleapis.com/http/server/response_count'
      | filter resource.project_id == '${GAE_PROJECT_ID}'
      | filter
          metric.response_code == 429
          || metric.response_code == 200
          || metric.response_code == 201
          || metric.response_code == 202
          || metric.response_code == 203
          || metric.response_code == 204
          || metric.response_code == 205
          || metric.response_code == 206
          || metric.response_code == 207
          || metric.response_code == 208
          || metric.response_code == 226
          || metric.response_code == 304
    filter_valid: >
      fetch gae_app
      | metric 'appengine.googleapis.com/http/server/response_count'
      | filter resource.project_id == '${GAE_PROJECT_ID}'
  goal: 0.95

The same discrepancies can be observed on a different project (slo-generator-demo for example) and/or a different GAE service.

@lvaylet lvaylet self-assigned this Nov 2, 2022
@lvaylet lvaylet added bug Something isn't working p/cloud_monitoring Cloud Monitoring provider issue backend p/cloud_monitoring_mql labels Nov 2, 2022
@lvaylet
Copy link
Collaborator Author

lvaylet commented Nov 2, 2022

A standalone test script saved as test_mqf_vs_mql.py confirms the observed behavior:

"""Make sure requests to Cloud Monitoring with MQF and MQL return the same results."""
from datetime import datetime

from slo_generator.backends.cloud_monitoring import CloudMonitoringBackend
from slo_generator.backends.cloud_monitoring_mql import CloudMonitoringMqlBackend

PROJECT_ID: str = "slo-generator-demo"

mqf_backend = CloudMonitoringBackend(PROJECT_ID)
mql_backend = CloudMonitoringMqlBackend(PROJECT_ID)

# Use a specific, fixed timestamp for both queries and the same window.
# Python uses floating point numbers to represent time in seconds since the epoch, in
# UTC, with decimal part representing nanoseconds.
end_time: float = 1666995015.5144777
# MQL expects dates formatted like "%Y/%m/%d %H:%M:%S" or "%Y/%m/%d-%H:%M:%S".
# Reference: https://cloud.google.com/monitoring/mql/reference#lexical-elements
end_time_str: str = datetime.fromtimestamp(end_time).strftime("%Y/%m/%d %H:%M:%S")
assert end_time_str == "2022/10/28 22:10:15"  # MQL ignores nanoseconds

mqf_query: str = """resource.type="gae_app"
metric.type="appengine.googleapis.com/http/server/response_count"
resource.labels.project_id="slo-generator-demo"
resource.labels.module_id="ratingservice"
( metric.labels.response_code = 429 OR
  metric.labels.response_code = 200 OR
  metric.labels.response_code = 201 OR
  metric.labels.response_code = 202 OR
  metric.labels.response_code = 203 OR
  metric.labels.response_code = 204 OR
  metric.labels.response_code = 205 OR
  metric.labels.response_code = 206 OR
  metric.labels.response_code = 207 OR
  metric.labels.response_code = 208 OR
  metric.labels.response_code = 226 OR
  metric.labels.response_code = 304 )
"""
mql_query: str = """fetch gae_app
| metric 'appengine.googleapis.com/http/server/response_count'
| filter resource.project_id == 'slo-generator-demo'
| filter resource.module_id == 'ratingservice'
| filter
    metric.response_code == 429
    || metric.response_code == 200
    || metric.response_code == 201
    || metric.response_code == 202
    || metric.response_code == 203
    || metric.response_code == 204
    || metric.response_code == 205
    || metric.response_code == 206
    || metric.response_code == 207
    || metric.response_code == 208
    || metric.response_code == 226
    || metric.response_code == 304
"""

for window in [3600, 24 * 3600, 7 * 24 * 3600, 28 * 24 * 3600, 6 * 28 * 24 * 3600]:
    mqf_count = mqf_backend.count(list(mqf_backend.query(end_time, window, mqf_query)))
    mql_count = mql_backend.count(list(mql_backend.query(mql_query, window)))
    print(f"window = {window}s | MQF: {mqf_count} | MQL: {mql_count}")

The output is:

(venv) user@workstation-l9pyhi6x:~/workspace/github/google/slo-generator$ python test_mqf_vs_mql.py
window = 3600s | MQF: 1124 | MQL: 1139
window = 86400s | MQF: 27371 | MQL: 27398
window = 604800s | MQF: 191674 | MQL: 191654
window = 2419200s | MQF: 767616 | MQL: 767165
window = 14515200s | MQF: 1401063 | MQL: 1535099

@lvaylet
Copy link
Collaborator Author

lvaylet commented Nov 2, 2022

The changes from #290 seem to fix the issue. The small discrepancies observed with 7-day and 6-month windows (1 event and 5 events, respectively) represent a negligible percentage of the actual number of events.

(venv) user@workstation-l9pyhi6x:~/workspace/github/google/slo-generator$ python test_mqf_vs_mql.py
window = 3600s | MQF: 1124 | MQL: 1124
window = 86400s | MQF: 27371 | MQL: 27371
window = 604800s | MQF: 191674 | MQL: 191675
window = 2419200s | MQF: 767616 | MQL: 767616
window = 14515200s | MQF: 1401063 | MQL: 1401068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend bug Something isn't working p/cloud_monitoring_mql p/cloud_monitoring Cloud Monitoring provider issue
Projects
None yet
1 participant