Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus: reliable dynamic host discovery #136789

Open
srosenberg opened this issue Dec 5, 2024 · 1 comment
Open

prometheus: reliable dynamic host discovery #136789

srosenberg opened this issue Dec 5, 2024 · 1 comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team

Comments

@srosenberg
Copy link
Member

srosenberg commented Dec 5, 2024

While roachprod supported only single-tenant clusters, we could rely on gce_sd_configs to dynamically discover VMs and configure scrape targets on default ports, e.g., 26258 for CRDB metrics. However, as described in [1], the support for multi-tenant clusters added a new problem, namely the discovery of custom ports, assigned during the provisioning of (sql) tenants; i.e., 26258 remains as the system tenant, but other tenants are assigned random ports. (roachprod.DiscoverService maps tenants to their ports via DNS SRV records.)

The chosen solution (see the wiki linked in [1]) ended up using file_sd_configs backed by a simple REST API, implemented in [2]. In roachprod.Start, we invoke UpdateTargets, which instructs prom-helper-service to create .yml file on the prometheus host. E.g.,

head /opt/prom/prometheus/instance-configs/teamcity-18024845-1733294775-151-n6cpu4-geo.yml

- targets:
  - 10.142.1.38:26258
  labels:
    cluster: teamcity-18024845-1733294775-151-n6cpu4-geo
    host_ip: 10.142.1.38
    instance: teamcity-18024845-1733294775-151-n6cpu4-geo-0003
    job: cockroachdb
    node: "3"
    project: cockroach-ephemeral
    region: us-east

In roachprod.DestroyCluster, we invoke c, which instructs prom-helper-service to remove the corresponding .yml file. This simple mechanism seems to work assuming the invocations of UpdateTargets and DeleteClusterConfig succeed.

Since roachprod.Start can be invoked multiple times for a given cluster, e.g., starting a subset of the nodes at a time, UpdateTargets must be able to succeed each time; otherwise, it may fail to discover some of the tenants. Note, the current implementation doesn't even support tenants; it uses system instead.

Failing to execute DeleteClusterConfig results in a stale scrape config. Because the labels are static, this can yield a rather undesirable side-effect, when the same private ip is being reused by an entirely different cluster. E.g., consider the following failure,

[w11] 2024/12/04 11:11:44 cluster_cloud.go:424: Failed to delete the cluster config with cluster as secure: DeleteClusterConfig: failed on url: https://grafana.testeng.crdb.io/promhelpers/v1/instance-configs/teamcity-18024999-1733294269-218-n6cpu4: Delete "https://grafana.testeng.crdb.io/promhelpers/v1/instance-configs/teamcity-18024999-1733294269-218-n6cpu4": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

At this point, the corresponding scrape config. will remain on the filesystem indefinitely. At a later time, a new cluster is going to reuse the same ip(s). Thus, the stale scrape config. is active again, except this time it's ingesting timeseries which are bogus, and duplicated.

While adding a GC service to prom-helper-service may seem like a solution for removing stale configs., it doesn't address the ip reuse. The labels should be dynamically discovered from a VM, instead of statically assigned to an ip; it doesn't appear that file_sd_configs supports it.

[1] #117125
[2] https://github.com/cockroachlabs/prom-helper-service

Jira issue: CRDB-45248

@srosenberg srosenberg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team labels Dec 5, 2024
Copy link

blathers-crl bot commented Dec 5, 2024

cc @cockroachdb/test-eng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-testeng TestEng Team
Projects
None yet
Development

No branches or pull requests

1 participant