prometheus: reliable dynamic host discovery #136789
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-testeng
TestEng Team
While
roachprod
supported only single-tenant clusters, we could rely ongce_sd_configs
to dynamically discover VMs and configure scrape targets on default ports, e.g.,26258
for CRDB metrics. However, as described in [1], the support for multi-tenant clusters added a new problem, namely the discovery of custom ports, assigned during the provisioning of (sql) tenants; i.e.,26258
remains as thesystem
tenant, but other tenants are assigned random ports. (roachprod.DiscoverService
maps tenants to their ports via DNS SRV records.)The chosen solution (see the wiki linked in [1]) ended up using
file_sd_configs
backed by a simple REST API, implemented in [2]. Inroachprod.Start
, we invokeUpdateTargets
, which instructsprom-helper-service
to create.yml
file on the prometheus host. E.g.,In
roachprod.DestroyCluster
, we invokec, which instructs prom-helper-service
to remove the corresponding.yml
file. This simple mechanism seems to work assuming the invocations ofUpdateTargets
andDeleteClusterConfig
succeed.Since
roachprod.Start
can be invoked multiple times for a given cluster, e.g., starting a subset of the nodes at a time,UpdateTargets
must be able to succeed each time; otherwise, it may fail to discover some of the tenants. Note, the current implementation doesn't even support tenants; it usessystem
instead.Failing to execute
DeleteClusterConfig
results in a stale scrape config. Because the labels are static, this can yield a rather undesirable side-effect, when the same private ip is being reused by an entirely different cluster. E.g., consider the following failure,At this point, the corresponding scrape config. will remain on the filesystem indefinitely. At a later time, a new cluster is going to reuse the same ip(s). Thus, the stale scrape config. is active again, except this time it's ingesting timeseries which are bogus, and duplicated.
While adding a GC service to
prom-helper-service
may seem like a solution for removing stale configs., it doesn't address the ip reuse. The labels should be dynamically discovered from a VM, instead of statically assigned to an ip; it doesn't appear thatfile_sd_configs
supports it.[1] #117125
[2] https://github.com/cockroachlabs/prom-helper-service
Jira issue: CRDB-45248
The text was updated successfully, but these errors were encountered: