Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service Bus queue metrics are not reported correctly for entities in same namespace #1050

Closed
ChristianEder opened this issue May 25, 2020 · 12 comments · Fixed by #1060
Closed
Labels
bug Something isn't working
Milestone

Comments

@ChristianEder
Copy link
Contributor

I am trying to scrape active messages and deadlettered messages per queue for multiple queues from an Azure Service Bus Namespace. I tried 3 different approaches, leading to 2 different, both wrong, results:

  1. Use the ServiceBusQueue resource type
- name: active_messages_via_ServiceBusQueue
  description: "Count of active messages per queue / topic"
  resourceType: ServiceBusQueue
  azureMetricConfiguration:
    aggregation:
      type: Average
    metricName: ActiveMessages
  resources:
  - namespace: 'my-namespace'
    queueName: queue1  
  - namespace: 'my-namespace'
    queueName: queue2
  - namespace: 'my-namespace'
    queueName: queue3
  1. Use the Generic resource type
- name: active_messages_via_Generic
  description: "Count of active messages per queue / topic"
  resourceType: Generic
  azureMetricConfiguration:
    metricName: ActiveMessages
    aggregation:
      type: Average
  resources:
  - resourceUri: Microsoft.ServiceBus/namespaces/my-namespace
    filter: EntityName eq 'queue1'
  - resourceUri: Microsoft.ServiceBus/namespaces/my-namespace
    filter: EntityName eq 'queue2'
  - resourceUri: Microsoft.ServiceBus/namespaces/my-namespace
    filter: EntityName eq 'queue3'
  1. Use the Generic resource type without filters

Same as 2), but with only a single resource defined (the namespace), without a filter.

Expected Behavior

Per queue I get a metric reported with labels matching the queues name.

Actual Behavior

Approach 1) (via ServiceBusQueue resource type) reports only the metrics for the queue that is last in the list of defined resources.

Approach 2) and 3) (via Generic resource type) reports only a single metric, containing the sum for all queues in the namespace.

Steps to Reproduce the Problem

  1. Create a service bus namespace with multiple queues
  2. Send some messages into the queues
  3. Run a promitor instance (I was using latest, so currently 1.6.0) configured to scrape the queue metrics using one of the approaches described above
  4. Open http://localhost:8999/metrics and compare the reported metrics with those shown in the azure portal

My instance returns the following metrics (I configured all 3 approaches described above in the same instance, using 3 different metric names):

# HELP active_messages_via_Generic Count of active messages per queue / topic
# TYPE active_messages_via_Generic gauge
active_messages_via_Generic{resource_group="my-rg",subscription_id="my-subscription-id",resource_uri="subscriptions/my-subscription-id/resourceGroups/my-rg/providers/Microsoft.ServiceBus/namespaces/my-namespace"} 23
# HELP active_messages_via_Generic_total Count of active messages per queue / topic
# TYPE active_messages_via_Generic_total gauge
active_messages_via_Generic_total{resource_group="my-rg",subscription_id="my-subscription-id",resource_uri="subscriptions/my-subscription-id/resourceGroups/my-rg/providers/Microsoft.ServiceBus/namespaces/my-namespace"} 23
# HELP active_messages_via_ServiceBusQueue Count of active messages per queue / topic
# TYPE active_messages_via_ServiceBusQueue gauge
active_messages_via_ServiceBusQueue{resource_group="my-rg",subscription_id="my-subscription-id",resource_uri="subscriptions/my-subscription-id/resourceGroups/my-rg/providers/Microsoft.ServiceBus/namespaces/my-namespace",instance_name="queue3",entity_name="queue3"} 0
# HELP promitor_ratelimit_arm Indication how many calls are still available before Azure Resource Manager is going to throttle us.
# TYPE promitor_ratelimit_arm gauge
promitor_ratelimit_arm{tenant_id="bab8ae91-c774-40ac-9d72-282d0f55f03a",subscription_id="my-subscription-id",app_id="my-app-id"} 11998

A look into the Azure portal shows that indeed, the count of messages is 0 (as reported) for the 'queue3', but the values for the other 2 queues only show up as the sum (23), where in the Portal i can see one is 21, and the other is 2.

Portal Screenshot showing that the sum is actually 23:
image

Portal Screenshot showing that one of the queues has 21:
image

Specifications

  • Version: 1.6.0
  • Platform: Windows 10 host, Linux docker containers
@ChristianEder
Copy link
Contributor Author

I just tested: with the 1.5.0 version, approach 1) using the ServiceBusQueue Resource Type works fine

@tomkerkhove
Copy link
Owner

Huh that's interesting 🤔

@tomkerkhove tomkerkhove added this to the v1.6.1 milestone May 27, 2020
@tomkerkhove
Copy link
Owner

This seems to be working on my end but will dig deeper later

@ChristianEder
Copy link
Contributor Author

I tried also with tomkerkhove/promitor-agent-scraper-ci:pr1054, but get the same result - only the metrics for the last defined queue get reported, while with tomkerkhove/promitor-agent-scraper:1.5.0 all 3 get reported. The only difference in my onfiguration (besides the image tag) is how I define the metric sinks (legacy vs new metric sinks).

Another thing that gets reported differently between the two:

The ci-pr1054 version reports promitor_ratelimit_arm AND ratelimit_arm (both have the same value), while 1.5.0 reports only promitor_ratelimit_arm

@tomkerkhove
Copy link
Owner

Thanks for the pointers, this is helpful!

@tomkerkhove
Copy link
Owner

tomkerkhove commented May 28, 2020

The ci-pr1054 version reports promitor_ratelimit_arm AND ratelimit_arm (both have the same value), while 1.5.0 reports only promitor_ratelimit_arm

Created #1056 for that

@tomkerkhove
Copy link
Owner

When using this config:

version: v1
azureMetadata:
  tenantId: c8819874-9e56-4e3f-b1a8-1c0325138f27
  subscriptionId: 0f9d7fea-99e8-4768-8672-06a28514f77e
  resourceGroupName: promitor
metricDefaults:
  aggregation:
    interval: 00:05:00
  scraping:
    # Every minute
    schedule: "0 * * ? * *"
metrics:
  - name: promitor_demo_servicebusqueue_queue_size
    description: "Amount of active messages of the 'orders' queue (determined with ServiceBusQueue provider)"
    resourceType: ServiceBusQueue
    azureMetricConfiguration:
      metricName: ActiveMessages
      aggregation:
        type: Average
        # Optionally override the default aggregation interval (metricDefaults.aggregation.interval)
        interval: 00:15:00
    resources:
    - namespace: promitor-messaging
      queueName: orders
    - namespace: promitor-messaging-other-subscription
      queueName: orders
      subscriptionId: 0329dd2a-59dc-4493-aa54-cb01cb027dc2
      resourceGroupName: promitor-sources

With this server configuration:

server:
  httpPort: 88
metricSinks:
  prometheusScrapingEndpoint:
    metricUnavailableValue: NaN
    enableMetricTimestamps: true # true by default
    baseUriPath: /metrics
  statsd:
    host: graphite
    port: 8125
    metricPrefix: promitor.
metricsConfiguration:
  absolutePath: /config/metrics-declaration.yaml
telemetry:
  applicationInsights:
    instrumentationKey: ABC
    isEnabled: true
    verbosity: warning
  containerLogs:
    isEnabled: true
    verbosity: trace
  defaultVerbosity: trace
azureMonitor:
  logging:
    informationLevel: Headers
    isEnabled: true

Everything gets reported nicely:

# HELP promitor_demo_servicebusqueue_queue_size Amount of active messages of the 'orders' queue (determined with ServiceBusQueue provider)
# TYPE promitor_demo_servicebusqueue_queue_size gauge
promitor_demo_servicebusqueue_queue_size{resource_group="promitor-sources",subscription_id="0329dd2a-59dc-4493-aa54-cb01cb027dc2",resource_uri="subscriptions/0329dd2a-59dc-4493-aa54-cb01cb027dc2/resourceGroups/promitor-sources/providers/Microsoft.ServiceBus/namespaces/promitor-messaging-other-subscription",instance_name="orders",entity_name="orders"} 12.133333333333333
promitor_demo_servicebusqueue_queue_size{resource_group="promitor",subscription_id="0f9d7fea-99e8-4768-8672-06a28514f77e",resource_uri="subscriptions/0f9d7fea-99e8-4768-8672-06a28514f77e/resourceGroups/promitor/providers/Microsoft.ServiceBus/namespaces/promitor-messaging",instance_name="orders",entity_name="orders"} 540.4

I used the same config against tomkerkhove/promitor-agent-scraper-ci:pr1054and resulted in:

# HELP promitor_demo_servicebusqueue_queue_size Amount of active messages of the 'orders' queue (determined with ServiceBusQueue provider)
# TYPE promitor_demo_servicebusqueue_queue_size gauge
promitor_demo_servicebusqueue_queue_size{resource_group="promitor-sources",subscription_id="0329dd2a-59dc-4493-aa54-cb01cb027dc2",resource_uri="subscriptions/0329dd2a-59dc-4493-aa54-cb01cb027dc2/resourceGroups/promitor-sources/providers/Microsoft.ServiceBus/namespaces/promitor-messaging-other-subscription",instance_name="orders",entity_name="orders"} 13
promitor_demo_servicebusqueue_queue_size{resource_group="promitor",subscription_id="0f9d7fea-99e8-4768-8672-06a28514f77e",resource_uri="subscriptions/0f9d7fea-99e8-4768-8672-06a28514f77e/resourceGroups/promitor/providers/Microsoft.ServiceBus/namespaces/promitor-messaging",instance_name="orders",entity_name="orders"} 579

(command docker run -d -p 8999:88 --name promitor-agent-scraper-repro --env-file C:/Promitor/az-mon-auth.creds --volume C:/Promitor/metrics-declaration-repro.yaml:/config/metrics-declaration.yaml --volume C:/Promitor/runtime-repro.yaml:/config/runtime.yaml tomkerkhove/promitor-agent-scraper-ci:pr1054)

Can you share the logs of your instance please? I'll need to have some more info since I cannot reproduce it :/

@ChristianEder
Copy link
Contributor Author

ChristianEder commented May 29, 2020

Here you go, these are the logs with just sensitive information replaced, using tomkerkhove/promitor-agent-scraper-ci:pr1054

██████╗ ██████╗  ██████╗ ███╗   ███╗██╗████████╗ ██████╗ ██████╗ 
██╔══██╗██╔══██╗██╔═══██╗████╗ ████║██║╚══██╔══╝██╔═══██╗██╔══██╗
██████╔╝██████╔╝██║   ██║██╔████╔██║██║   ██║   ██║   ██║██████╔╝
██╔═══╝ ██╔══██╗██║   ██║██║╚██╔╝██║██║   ██║   ██║   ██║██╔══██╗
██║     ██║  ██║╚██████╔╝██║ ╚═╝ ██║██║   ██║   ╚██████╔╝██║  ██║
╚═╝     ╚═╝  ╚═╝ ╚═════╝ ╚═╝     ╚═╝╚═╝   ╚═╝    ╚═════╝ ╚═╝  ╚═╝
[05:49:21 INF] Starting validation of Promitor setup
[05:49:21 INF] Start Validation step 1/6: Metrics Declaration Path
[05:49:21 INF] Scrape configuration found at '/config/metrics-declaration.yaml'
[05:49:21 INF] Validation step 1/6 succeeded
[05:49:21 INF] Start Validation step 2/6: Azure Authentication
[05:49:21 INF] Validation step 2/6 succeeded
[05:49:21 INF] Start Validation step 3/6: Metrics Declaration
[05:49:21 INF] Metrics declaration is using spec version v1
[05:49:21 WRN] The following problems were found with the metric configuration:
Warning 11:3: Unknown field 'labels'.
[05:49:22 INF] Following metrics configuration was configured:
azureMetadata:
  resourceGroupName: my-resource-group
  subscriptionId: my-subscription
  tenantId: my-tenant
  cloud:
    name: AzureGlobalCloud
    authenticationEndpoint: https://login.microsoftonline.com/
    resourceManagerEndpoint: https://management.azure.com/
    graphEndpoint: https://graph.windows.net/
    managementEndpoint: https://management.core.windows.net/
    storageEndpointSuffix: core.windows.net
    keyVaultSuffix: vault.azure.net
metricDefaults:
  aggregation:
    interval: 00:05:00
  scraping: &o0
    schedule: 0 */2 * ? * *
metrics:
- azureMetricConfiguration:
    metricName: ActiveMessages
    aggregation:
      type: Average
      interval: 00:05:00
  prometheusMetricDefinition:
    name: azure_service_bus_active_messages
    description: Count of active messages per queue / topic
    labels: {}
  scraping: *o0
  resourceType: ServiceBusQueue
  resources:
  - namespace: my-namespace
    queueName: queue1
    resourceType: ServiceBusQueue
  - namespace: my-namespace
    queueName: queue2
    resourceType: ServiceBusQueue
  - namespace: my-namespace
    queueName: queue3
    resourceType: ServiceBusQueue

[05:49:22 INF] Validation step 3/6 succeeded
[05:49:22 INF] Start Validation step 4/6: Resource Discovery
[05:49:22 INF] Validation step 4/6 succeeded
[05:49:22 INF] Start Validation step 5/6: StatsD Metric Sink
[05:49:22 INF] Validation step 5/6 succeeded
[05:49:22 INF] Start Validation step 6/6: Prometheus Scraping Endpoint Metric Sink
[05:49:22 INF] Validation step 6/6 succeeded
[05:49:22 INF] Promitor configuration is valid, we are good to go.
[05:49:22 INF] Metrics declaration is using spec version v1
[05:49:22 INF] [Job][my-subscription-azure_service_bus_active_messages] is running under time zone: Etc/UTC
[05:49:22 INF] [Job][my-subscription-azure_service_bus_active_messages] is running under time zone: Etc/UTC
[05:49:22 INF] [Job][my-subscription-azure_service_bus_active_messages] is running under time zone: Etc/UTC
[05:49:22 INF] Scraping Azure Monitor - 05/29/2020 05:49:22 +00:00
[05:49:22 INF] Scraping azure_service_bus_active_messages for resource type ServiceBusQueue
[05:49:22 DBG] Failed to locate the development https certificate at 'null'.
[05:49:22 INF] Now listening on: http://[::]:80
[05:49:22 INF] Application started. Press Ctrl+C to shut down.
[05:49:22 INF] Hosting environment: Production
[05:49:22 INF] Content root path: /app
[05:49:27 INF] Found value 0 for metric azure_service_bus_active_messages with aggregation interval 00:05:00
[05:49:27 INF] Metrics declaration is using spec version v1
[05:49:32 INF] Request starting HTTP/1.1 GET http://localhost:8999/metrics  
[05:49:32 DBG] Wildcard detected, all requests with hosts will be allowed.
[05:49:32 VRB] All hosts are allowed.
[05:49:32 DBG] No candidates found for the request path '/metrics'
[05:49:32 DBG] Request did not match any endpoints
[05:49:32 VRB] Setting correlation response header 'X-Transaction-ID' to '683f9178-d84e-49fc-9897-f1a44147b0fb'
[05:49:32 VRB] Setting correlation response header 'RequestId' to '0HM03IG7RAJ77:00000001'
[05:49:32 INF] HTTP Request GET http://localhost:8999//metrics completed with 200 in 00:00:00.0452332 at 05/29/2020 05:49:32 +00:00 - (Context: {"Connection": ["keep-alive"], "Accept": ["text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"], "Accept-Encoding": ["gzip, deflate, br"], "Accept-Language": ["en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7,fr;q=0.6"], "Host": ["localhost:8999"], "User-Agent": ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37"], "Upgrade-Insecure-Requests": ["1"], "Sec-Fetch-Site": ["none"], "Sec-Fetch-Mode": ["navigate"], "Sec-Fetch-User": ["?1"], "Sec-Fetch-Dest": ["document"]})
[05:49:32 INF] Request finished in 72.2303ms 200 text/plain; version=0.0.4
[05:49:32 INF] Request starting HTTP/1.1 GET http://localhost:8999/favicon.ico
[05:49:32 VRB] All hosts are allowed.
[05:49:32 DBG] No candidates found for the request path '/favicon.ico'
[05:49:32 DBG] Request did not match any endpoints
[05:49:32 DBG] The request path  does not match the path filter
[05:49:32 DBG] The request path  does not match the path filter
[05:49:32 INF] HTTP Request GET http://localhost:8999//favicon.ico completed with 404 in 00:00:00.0264407 at 05/29/2020 05:49:32 +00:00 - (Context: {"Connection": ["keep-alive"], "Accept": ["image/webp,image/apng,image/*,*/*;q=0.8"], "Accept-Encoding": ["gzip, deflate, br"], "Accept-Language": ["en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7,fr;q=0.6"], "Host": ["localhost:8999"], "Referer": ["http://localhost:8999/metrics"], "User-Agent": ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37"], "Sec-Fetch-Site": ["same-origin"], "Sec-Fetch-Mode": ["no-cors"], "Sec-Fetch-Dest": ["image"]})        
[05:49:32 VRB] Setting correlation response header 'X-Transaction-ID' to '5b5211a8-2676-4edb-a77b-03f27f78c86b'
[05:49:32 VRB] Setting correlation response header 'RequestId' to '0HM03IG7RAJ77:00000002'
[05:49:32 INF] Request finished in 27.4889ms 404
[05:49:37 INF] Request starting HTTP/1.1 GET http://localhost:8999/metrics  
[05:49:37 VRB] All hosts are allowed.
[05:49:37 DBG] No candidates found for the request path '/metrics'
[05:49:37 DBG] Request did not match any endpoints
[05:49:37 VRB] Setting correlation response header 'X-Transaction-ID' to 'ded8d559-41e3-4705-b92b-b08618af852d'
[05:49:37 VRB] Setting correlation response header 'RequestId' to '0HM03IG7RAJ77:00000003'
[05:49:37 INF] HTTP Request GET http://localhost:8999//metrics completed with 200 in 00:00:00.0029253 at 05/29/2020 05:49:37 +00:00 - (Context: {"Cache-Control": ["max-age=0"], "Connection": ["keep-alive"], "Accept": ["text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"], "Accept-Encoding": ["gzip, deflate, br"], "Accept-Language": ["en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7,fr;q=0.6"], "Host": ["localhost:8999"], "User-Agent": ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37"], "Upgrade-Insecure-Requests": ["1"], 
"Sec-Fetch-Site": ["none"], "Sec-Fetch-Mode": ["navigate"], "Sec-Fetch-User": ["?1"], "Sec-Fetch-Dest": ["document"]})
[05:49:37 INF] Request finished in 6.933ms 200 text/plain; version=0.0.4
[05:49:38 INF] Request starting HTTP/1.1 GET http://localhost:8999/metrics
[05:49:38 VRB] All hosts are allowed.
[05:49:38 DBG] No candidates found for the request path '/metrics'
[05:49:38 DBG] Request did not match any endpoints
[05:49:38 VRB] Setting correlation response header 'X-Transaction-ID' to '59c9fbab-52ee-44dc-9218-4562a61d161a'
[05:49:38 VRB] Setting correlation response header 'RequestId' to '0HM03IG7RAJ77:00000004'
[05:49:38 INF] HTTP Request GET http://localhost:8999//metrics completed with 200 in 00:00:00.0011642 at 05/29/2020 05:49:38 +00:00 - (Context: {"Cache-Control": ["max-age=0"], "Connection": ["keep-alive"], "Accept": ["text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"], "Accept-Encoding": ["gzip, deflate, br"], "Accept-Language": ["en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7,fr;q=0.6"], "Host": ["localhost:8999"], "User-Agent": ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37"], "Upgrade-Insecure-Requests": ["1"], 
"Sec-Fetch-Site": ["none"], "Sec-Fetch-Mode": ["navigate"], "Sec-Fetch-User": ["?1"], "Sec-Fetch-Dest": ["document"]})
[05:49:38 INF] Request finished in 3.2168ms 200 text/plain; version=0.0.4

I can see in the logs that 3 scraping Jobs for the same metric name get instantiated which looks like its expected, right? But then only 1 of these also logs that it scraped a value.

I will also try with your config and see whats the difference

@ChristianEder
Copy link
Contributor Author

And now I tried your configuration, with just the credentials, namespace and queue names replaced - with the same result, still only 1 queue gets reported. The only difference between our configurations that I can spot is, that you are using 2 queues from different namespaces, while my queues are both in the same namespace

@ChristianEder
Copy link
Contributor Author

Was there any change to either the ServiceBusQueue resource or some base classes from 1.5.0 to now? Because with 1.5.0 even with my initial configuration, I get all the queues reported just fine.

@tomkerkhove tomkerkhove added support and removed bug Something isn't working labels May 29, 2020
@tomkerkhove
Copy link
Owner

There was some refactoring which might have caused this so will try with multiple queues in same namespace to hopefully reproduce this!

@tomkerkhove
Copy link
Owner

That seems to be the issue, when I use this:

metrics:
  - name: promitor_demo_servicebusqueue_queue_size
    description: "Amount of active messages of the 'orders' queue (determined with ServiceBusQueue provider)"
    resourceType: ServiceBusQueue
    azureMetricConfiguration:
      metricName: ActiveMessages
      aggregation:
        type: Average
        # Optionally override the default aggregation interval (metricDefaults.aggregation.interval)
        interval: 00:15:00
    resources:
    - namespace: promitor-messaging
      queueName: orders
    - namespace: promitor-messaging
      queueName: sales
    - namespace: promitor-messaging-other-subscription
      queueName: orders
      subscriptionId: 0329dd2a-59dc-4493-aa54-cb01cb027dc2
      resourceGroupName: promitor-sources

It only returns this:

# HELP promitor_demo_servicebusqueue_queue_size Amount of active messages of the 'orders' queue (determined with ServiceBusQueue provider)
# TYPE promitor_demo_servicebusqueue_queue_size gauge
promitor_demo_servicebusqueue_queue_size{resource_group="promitor",subscription_id="0f9d7fea-99e8-4768-8672-06a28514f77e",resource_uri="subscriptions/0f9d7fea-99e8-4768-8672-06a28514f77e/resourceGroups/promitor/providers/Microsoft.ServiceBus/namespaces/promitor-messaging",instance_name="sales",entity_name="sales"} 1
promitor_demo_servicebusqueue_queue_size{resource_group="promitor-sources",subscription_id="0329dd2a-59dc-4493-aa54-cb01cb027dc2",resource_uri="subscriptions/0329dd2a-59dc-4493-aa54-cb01cb027dc2/resourceGroups/promitor-sources/providers/Microsoft.ServiceBus/namespaces/promitor-messaging-other-subscription",instance_name="orders",entity_name="orders"} 13

@tomkerkhove tomkerkhove changed the title Service Bus Queue Metrics are not reported correctly Service Bus queue metrics are not reported correctly for entities in same namespace May 29, 2020
@tomkerkhove tomkerkhove added bug Something isn't working and removed support labels May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants