Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation Metrics no longer emitted #24339

Open
dosera opened this issue Oct 31, 2024 · 4 comments
Open

Allocation Metrics no longer emitted #24339

dosera opened this issue Oct 31, 2024 · 4 comments
Labels

Comments

@dosera
Copy link

dosera commented Oct 31, 2024

Nomad version

Nomad v1.9.1
BuildDate 2024-10-21T09:00:50Z
Revision d9ec23f0c1035401e9df6c64d6ffb8bffc555a5e

Operating system and Environment details

AlmaLinux release 9.4 (Seafoam Ocelot)

/etc/nomad/base.hcl:

telemetry {
..
  prometheus_metrics = "true"
  publish_allocation_metrics = "true"
  publish_node_metrics = "true"

..
}

Issue

After upgrading nomad from 1.8.4 to 1.9.1 allocation metrics like nomad_client_allocs_memory_usage appear not to be properly emitted (or only for several seconds - i am using prometheus to scrape them).

Reproduction steps

  1. Enable telemetry
  2. Start any allocation
  3. Collect metrics

Expected Result

Allocation metrics are available.

Actual Result

Allocation metrics are available sporadically (at best). See the attached screenshot - nomad was upgraded on Oct 24th.

This is how it looks in prometheus since then:
nomad no alloc metrics

@Himura2la
Copy link

Himura2la commented Nov 1, 2024

We have the same issue with allocation metrics after updating from 1.8.2 to 1.9.1:
image
query: avg_over_time(nomad_client_allocs_memory_usage[10m]) / avg_over_time(nomad_client_allocs_memory_allocated[10m])

They are returned irregularly, and often values are zero.

our configs:

telemetry {
  collection_interval = "30s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}
global:
  scrape_interval:     1m
  scrape_timeout:     20s
  evaluation_interval: 1m

  - job_name: nomad
    static_configs:
      - targets:
        # ...
    scheme: https
    metrics_path: /v1/metrics
    params:
      format: [prometheus]
    tls_config:
      insecure_skip_verify: true
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      # ...

Operating system: Debian 12, amd64

@jrasell
Copy link
Member

jrasell commented Nov 4, 2024

Hi @Himura2la and @dosera and thanks for raising this issue. I have been unable to reproduce this on macOS or Ubuntu 24.04 using the steps below. Could you please provide a minimal job specification that would help reproduce this along with any other relevant information?

Example agent config:

telemetry {
  prometheus_metrics         = "true"
  publish_allocation_metrics = "true"
  publish_node_metrics       = "true"
}

Example job:

job "example" {

  group "cache" {
    network {
      mode = "bridge"
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      identity {
        env  = true
        file = true
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Example test command:

while true; do
> curl -s localhost:4646/v1/metrics | jq '.Gauges | .[] | select(.Name | contains("nomad.client.allocs.memory.usage")) |.Value'
> curl -s 'localhost:4646/v1/metrics?format=prometheus' |grep 'nomad.client.allocs.memory.usage{' |awk '{print $2}'
> sleep 2
> done

@Himura2la
Copy link

Himura2la commented Nov 4, 2024

I'm afraid your validation method is not relevant to the issue, because it's best visible on a long term run.
I think you can simply follow this tutorial https://developer.hashicorp.com/nomad/tutorials/manage-clusters/prometheus-metrics and see how it's going with 5-10 jobs that consume something for several hours

@jedd
Copy link

jedd commented Nov 5, 2024

I'm seeing the same thing with 4 x boxes I upgraded to 1.9.1 from 1.9.0
2024-10-18 - upgraded from 1.8.4 to 1.9.0 - no issues observed at this time
2024-10-25 - upgraded from 1.9.0 to 1.9.1 - from this point I am seeing a lot of gaps in the data - image attached

I'm using Debian testing on all four machines.

My Prometheus scrapes a telegraf (prometheus export mode) on the same machines, and I'm seeing no drops in those data.

Scrape interval for both is set to 1m. Boxes had no other packages upgraded at that time.

More importantly, I'm not seeing CPU or Memory data in the Nomad GUI in either Task or Allocation view. I will occasionally get a red line indicating what I presume is 'current' X MiB / Total memory used, but no history at all.
CPU graph consistently shows 0MHz (and no history of course, either).

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

4 participants