Allocation Metrics no longer emitted #24339

dosera · 2024-10-31T09:51:18Z

Nomad version

Nomad v1.9.1
BuildDate 2024-10-21T09:00:50Z
Revision d9ec23f0c1035401e9df6c64d6ffb8bffc555a5e

Operating system and Environment details

AlmaLinux release 9.4 (Seafoam Ocelot)

/etc/nomad/base.hcl:

telemetry {
..
  prometheus_metrics = "true"
  publish_allocation_metrics = "true"
  publish_node_metrics = "true"

..
}

Issue

After upgrading nomad from 1.8.4 to 1.9.1 allocation metrics like nomad_client_allocs_memory_usage appear not to be properly emitted (or only for several seconds - i am using prometheus to scrape them).

Reproduction steps

Enable telemetry
Start any allocation
Collect metrics

Expected Result

Allocation metrics are available.

Actual Result

Allocation metrics are available sporadically (at best). See the attached screenshot - nomad was upgraded on Oct 24th.

This is how it looks in prometheus since then:

The text was updated successfully, but these errors were encountered:

Himura2la · 2024-11-01T08:53:41Z

We have the same issue with allocation metrics after updating from 1.8.2 to 1.9.1:

query: avg_over_time(nomad_client_allocs_memory_usage[10m]) / avg_over_time(nomad_client_allocs_memory_allocated[10m])

They are returned irregularly, and often values are zero.

our configs:

telemetry {
  collection_interval = "30s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

global:
  scrape_interval:     1m
  scrape_timeout:     20s
  evaluation_interval: 1m

  - job_name: nomad
    static_configs:
      - targets:
        # ...
    scheme: https
    metrics_path: /v1/metrics
    params:
      format: [prometheus]
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      # ...

Operating system: Debian 12, amd64

jrasell · 2024-11-04T11:39:41Z

Hi @Himura2la and @dosera and thanks for raising this issue. I have been unable to reproduce this on macOS or Ubuntu 24.04 using the steps below. Could you please provide a minimal job specification that would help reproduce this along with any other relevant information?

Example agent config:

telemetry {
  prometheus_metrics         = "true"
  publish_allocation_metrics = "true"
  publish_node_metrics       = "true"
}

Example job:

job "example" {

  group "cache" {
    network {
      mode = "bridge"
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      identity {
        env  = true
        file = true
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Example test command:

while true; do
> curl -s localhost:4646/v1/metrics | jq '.Gauges | .[] | select(.Name | contains("nomad.client.allocs.memory.usage")) |.Value'
> curl -s 'localhost:4646/v1/metrics?format=prometheus' |grep 'nomad.client.allocs.memory.usage{' |awk '{print $2}'
> sleep 2
> done

Himura2la · 2024-11-04T13:46:31Z

I'm afraid your validation method is not relevant to the issue, because it's best visible on a long term run.
I think you can simply follow this tutorial https://developer.hashicorp.com/nomad/tutorials/manage-clusters/prometheus-metrics and see how it's going with 5-10 jobs that consume something for several hours

jedd · 2024-11-05T07:43:18Z

I'm seeing the same thing with 4 x boxes I upgraded to 1.9.1 from 1.9.0
2024-10-18 - upgraded from 1.8.4 to 1.9.0 - no issues observed at this time
2024-10-25 - upgraded from 1.9.0 to 1.9.1 - from this point I am seeing a lot of gaps in the data - image attached

I'm using Debian testing on all four machines.

My Prometheus scrapes a telegraf (prometheus export mode) on the same machines, and I'm seeing no drops in those data.

Scrape interval for both is set to 1m. Boxes had no other packages upgraded at that time.

More importantly, I'm not seeing CPU or Memory data in the Nomad GUI in either Task or Allocation view. I will occasionally get a red line indicating what I presume is 'current' X MiB / Total memory used, but no history at all.
CPU graph consistently shows 0MHz (and no history of course, either).

dosera added the type/bug label Oct 31, 2024

jrasell added the stage/waiting-reply label Nov 4, 2024

jrasell removed the stage/waiting-reply label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocation Metrics no longer emitted #24339

Allocation Metrics no longer emitted #24339

dosera commented Oct 31, 2024 •

edited

Loading

Himura2la commented Nov 1, 2024 •

edited

Loading

jrasell commented Nov 4, 2024

Himura2la commented Nov 4, 2024 •

edited

Loading

jedd commented Nov 5, 2024 •

edited

Loading

Allocation Metrics no longer emitted #24339

Allocation Metrics no longer emitted #24339

Comments

dosera commented Oct 31, 2024 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Himura2la commented Nov 1, 2024 • edited Loading

jrasell commented Nov 4, 2024

Himura2la commented Nov 4, 2024 • edited Loading

jedd commented Nov 5, 2024 • edited Loading

dosera commented Oct 31, 2024 •

edited

Loading

Himura2la commented Nov 1, 2024 •

edited

Loading

Himura2la commented Nov 4, 2024 •

edited

Loading

jedd commented Nov 5, 2024 •

edited

Loading