Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix openmetrics v2 service check: there shouldn’t be any host #13146

Merged
merged 8 commits into from
Oct 27, 2022

Conversation

L3n41c
Copy link
Member

@L3n41c L3n41c commented Oct 17, 2022

What does this PR do?

Remove host from openmetrics v2 metrics when executed as a service check.

Motivation

Service checks should never attach a host to the metrics they emit. Or at least, not the host of the cluster check runner that has executed the check.

Additional Notes

Describe how to test/QA your changes

Add AD annotations with openmetrics_endpoint and not prometheus_url on a k8s Service to schedule an openmetrics V2 check.
Validate that no host is attached to the metrics.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have changelog/ and integration/ labels attached
  • If the PR doesn't need to be tested during QA, please add a qa/skip-qa label.

@codecov
Copy link

codecov bot commented Oct 17, 2022

Codecov Report

Merging #13146 (e5db34f) into master (a821606) will decrease coverage by 0.36%.
The diff coverage is 100.00%.

Flag Coverage Δ
active_directory 100.00% <ø> (+17.64%) ⬆️
airflow 90.00% <ø> (ø)
ambari 85.75% <ø> (ø)
apache 95.08% <ø> (ø)
arangodb 98.21% <ø> (ø)
aspdotnet 100.00% <ø> (+26.19%) ⬆️
avi_vantage 92.50% <ø> (ø)
azure_iot_edge 82.00% <ø> (ø)
boundary 100.00% <ø> (ø)
btrfs 82.91% <ø> (ø)
cacti 87.90% <ø> (ø)
calico 83.33% <ø> (ø)
cilium 75.34% <ø> (+0.93%) ⬆️
cisco_aci 95.27% <ø> (ø)
citrix_hypervisor 87.50% <ø> (ø)
cloud_foundry_api 95.98% <ø> (+0.12%) ⬆️
cockroachdb 97.36% <ø> (ø)
consul 91.64% <ø> (ø)
coredns 94.54% <ø> (ø)
crio 89.79% <ø> (ø)
datadog_checks_base 89.79% <100.00%> (+0.36%) ⬆️
datadog_checks_downloader 70.61% <ø> (-7.26%) ⬇️
ddev 98.56% <ø> (+0.46%) ⬆️
disk 89.42% <ø> (-2.19%) ⬇️
dns_check 93.90% <ø> (ø)
dotnetclr 94.93% <ø> (+15.18%) ⬆️
druid 97.70% <ø> (ø)
eks_fargate 94.05% <ø> (ø)
envoy 94.00% <ø> (-0.23%) ⬇️
etcd 93.96% <ø> (ø)
exchange_server 96.85% <ø> (+11.81%) ⬆️
external_dns 89.09% <ø> (ø)
fluentd 94.77% <ø> (ø)
foundationdb 83.88% <ø> (ø)
gearmand 78.26% <ø> (+1.24%) ⬆️
harbor 80.04% <ø> (ø)
hazelcast 92.39% <ø> (ø)
hdfs_datanode 89.74% <ø> (ø)
hdfs_namenode 86.72% <ø> (ø)
http_check 95.38% <ø> (+2.08%) ⬆️
ibm_ace 91.79% <ø> (ø)
ibm_i 81.95% <ø> (-0.33%) ⬇️
ibm_mq 50.99% <ø> (-39.65%) ⬇️
impala 97.97% <ø> (ø)
istio 77.65% <ø> (+0.55%) ⬆️
kong 88.06% <ø> (ø)
kube_apiserver_metrics 97.67% <ø> (ø)
kube_controller_manager 96.85% <ø> (ø)
kube_dns 98.85% <ø> (ø)
kube_metrics_server 100.00% <ø> (ø)
kube_proxy 100.00% <ø> (ø)
kube_scheduler 97.14% <ø> (ø)
kubelet 90.86% <ø> (+<0.01%) ⬆️
kubernetes_state 89.35% <ø> (ø)
kyototycoon 85.96% <ø> (ø)
lighttpd 83.64% <ø> (ø)
linkerd 85.14% <ø> (+1.14%) ⬆️
linux_proc_extras 96.22% <ø> (ø)
mapr 82.70% <ø> (ø)
mapreduce 81.77% <ø> (+0.46%) ⬆️
marathon 83.12% <ø> (ø)
marklogic 96.03% <ø> (ø)
mcache 93.26% <ø> (ø)
mesos_master 89.75% <ø> (ø)
mesos_slave 93.63% <ø> (ø)
nagios 89.01% <ø> (ø)
network 87.76% <ø> (+2.68%) ⬆️
nfsstat 95.20% <ø> (ø)
nginx 95.24% <ø> (+0.54%) ⬆️
nginx_ingress_controller 98.36% <ø> (ø)
openldap 96.33% <ø> (ø)
openmetrics 97.90% <ø> (ø)
openstack 51.45% <ø> (ø)
openstack_controller 90.94% <ø> (ø)
pgbouncer 91.33% <ø> (ø)
postfix 88.04% <ø> (ø)
powerdns_recursor 96.65% <ø> (ø)
proxysql 98.97% <ø> (ø)
pulsar 100.00% <ø> (ø)
redisdb 87.50% <ø> (ø)
rethinkdb 97.93% <ø> (ø)
riak 99.22% <ø> (ø)
riakcs 93.61% <ø> (ø)
singlestore 90.81% <ø> (ø)
snowflake 96.47% <ø> (ø)
squid 100.00% <ø> (ø)
ssh_check 91.58% <ø> (ø)
statsd 87.36% <ø> (+1.05%) ⬆️
system_core 92.85% <ø> (ø)
system_swap 98.30% <ø> (ø)
tcp_check 91.58% <ø> (ø)
teamcity 80.00% <ø> (ø)
teradata 94.24% <ø> (ø)
tls 91.85% <ø> (+0.84%) ⬆️
tokumx 58.40% <ø> (?)
traffic_server 96.13% <ø> (ø)
twemproxy 79.45% <ø> (ø)
win32_event_log 86.68% <ø> (+0.55%) ⬆️
wmi_check 92.91% <ø> (ø)
yarn 89.14% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@L3n41c L3n41c force-pushed the lenaic/fix_openmetrics_v2_service_check branch from 2b00612 to 0b658e1 Compare October 17, 2022 12:52
clamoriniere
clamoriniere previously approved these changes Oct 17, 2022
```
style run-test: commands[3] | mypy --config-file=../mypy.ini --py2 --check-untyped-defs --follow-imports silent datadog_checks/base/checks/base.py datadog_checks/base/checks/win/wmi/__init__.py datadog_checks/base/checks/win/winpdh_base.py
datadog_checks/base/checks/base.py:213:13: error: Attribute "hostname" already defined on line 211
Found 1 error in 1 file (checked 3 source files)
ERROR: InvocationError for command /home/vsts/work/1/s/datadog_checks_base/.tox/style/bin/mypy --config-file=../mypy.ini --py2 --check-untyped-defs --follow-imports silent datadog_checks/base/checks/base.py datadog_checks/base/checks/win/wmi/__init__.py datadog_checks/base/checks/win/winpdh_base.py (exited with code 1)
```
This reverts commit e9af726.

This rewrite was useless since it was still generating the following error:

```
style run-test: commands[3] | mypy --config-file=../mypy.ini --py2 --check-untyped-defs --follow-imports silent datadog_checks/base/checks/base.py datadog_checks/base/checks/win/wmi/__init__.py datadog_checks/base/checks/win/winpdh_base.py
datadog_checks/base/checks/base.py:212:13: error: Attribute "hostname" already defined on line 210
Found 1 error in 1 file (checked 3 source files)
ERROR: InvocationError for command /home/vsts/work/1/s/datadog_checks_base/.tox/style/bin/mypy --config-file=../mypy.ini --py2 --check-untyped-defs --follow-imports silent datadog_checks/base/checks/base.py datadog_checks/base/checks/win/wmi/__init__.py datadog_checks/base/checks/win/winpdh_base.py (exited with code 1)
```
@L3n41c
Copy link
Member Author

L3n41c commented Oct 18, 2022

I’ve checked some of the CI tests failures and all the ones I looked at failed because of an error like:

E   Error: while running the jmx check: Error while starting api server, exiting: Unable to create the api server: listen tcp 127.0.0.1:41947: bind: address already in use

@L3n41c L3n41c marked this pull request as ready for review October 18, 2022 08:24
@L3n41c L3n41c requested a review from a team as a code owner October 18, 2022 08:24
Copy link
Contributor

@alopezz alopezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have enough context to understand what this is trying to do, but if the change targets OpenMetrics V2, it doesn't sound like we should be touching the base AgentCheck class that affects every integration.

@L3n41c
Copy link
Member Author

L3n41c commented Oct 18, 2022

On Kubernetes, a Service is something that doesn’t run on a specific node or on a specific host. It’s just a virtual IP reachable from anywhere.
As a consequence, service checks can be executed by any arbitrary cluster check runner running on any arbitrary host.
In this case, no host must be attached to the emitted metrics, otherwise, all the host tags of the node running the cluster check runner will be attached to the service check metrics and this is irrelevant.

It’s possible for a specific metric to specify an explicit host thanks to a hostname_label parameter for ex. but in no case the host should default to the one of the cluster check runner.

Because everything described above is generic and applies to all checks, the implementation of this logic is done in the GO code of datadog-agent.

Here is where the empty_default_hostname parameter is systematically added by the cluster agent to all the checks it dispatches to the cluster runners: https://github.com/DataDog/datadog-agent/blob/7.39.2/pkg/clusteragent/clusterchecks/dispatcher_configs.go#L152

This parameter is then used when initializing the senders used by the checks: https://github.com/DataDog/datadog-agent/blob/7.39.2/pkg/collector/python/check.go#L242-L250

The flag set is then used to not default the hostname when not set on a metric: https://github.com/DataDog/datadog-agent/blob/7.39.2/pkg/aggregator/sender.go#L253-L255

Everything described above applies to all checks.

In the light of what the datadog-agent does in the GO code, the python check should set a host only on metrics emitted for a particular host. It should never set the host to the host of the agent because defaulting to the host of the agent is anyhow already done, when needed, in the GO code of datadog-agent (https://github.com/DataDog/datadog-agent/blob/7.39.2/pkg/aggregator/sender.go#L253-L255).

This rule was respected by the V1 version of openmetrics.
But the V2 version of the python openmetrics check was explicitly setting the host to the host on which the cluster check runner is running here: https://github.com/DataDog/integrations-core/blob/7.39.0/datadog_checks_base/datadog_checks/base/checks/openmetrics/v2/scraper.py#L327

Although the bug has been initially reported on an openmetrics service check, the fact that defaulting the host of data to the host of the agent should be delegated to the GO code and not done in the python code applies to all checks.
That’s why I implemented a fix in the base class.

@alopezz
Copy link
Contributor

alopezz commented Oct 18, 2022

To keep context within the PR, as there was some offline discussion:

Checks based on the base AgentCheck already set the hostname to an empty string unless explicitly set when submitting a metric:

if hostname is None:
hostname = ''

The OpenMetricsV2 class is the one that's passing self.hostname along to the agent, which is inconsistent with what AgentCheck does, and, as @L3n41c pointed out, plays against what the agent expects.

Therefore, it seems better to simply make OpenMetricsV2 always set the hostname to an empty string by default to be consistent with AgentCheck and to let the agent handle this responsibility.

@L3n41c L3n41c requested a review from a team as a code owner October 18, 2022 15:07
@L3n41c
Copy link
Member Author

L3n41c commented Oct 18, 2022

@alopezz I’ve just updated this PR to move the fix in the openmetrics v2 specific code base to not touch the AgentCheck base class: https://github.com/DataDog/integrations-core/pull/13146/files#diff-c492f3334379c5b134abbeb99ddea95421fa9a0b9b8bdbf4e510f9a19bd9571f

@alopezz
Copy link
Contributor

alopezz commented Oct 18, 2022

Looks good but a couple of tests are failing, involving aggregator.assert_histogram_bucket assertions. I think changing those from check.hostname to "" should sort that out. That incidentally confirms that the change in behavior is as expected.

EDIT: I'm referring only to tests affecting datadog_checks_base, the rest of failures are not related from what I can see.

@github-actions
Copy link

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

@L3n41c L3n41c force-pushed the lenaic/fix_openmetrics_v2_service_check branch from c5d0871 to e5db34f Compare October 27, 2022 09:02
@L3n41c
Copy link
Member Author

L3n41c commented Oct 27, 2022

/azp help

@azure-pipelines
Copy link

Supported commands
  • help:
    • Get descriptions, examples and documentation about supported commands
    • Example: help "command_name"
  • list:
    • List all pipelines for this repository using a comment.
    • Example: "list"
  • run:
    • Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run.
    • Example: "run" or "run pipeline_name, pipeline_name, pipeline_name"
  • where:
    • Report back the Azure DevOps orgs that are related to this repository and org
    • Example: "where"

See additional documentation.

@L3n41c
Copy link
Member Author

L3n41c commented Oct 27, 2022

/azp list

Copy link
Contributor

@alopezz alopezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now.

@L3n41c L3n41c merged commit a8b01ea into master Oct 27, 2022
@L3n41c L3n41c deleted the lenaic/fix_openmetrics_v2_service_check branch October 27, 2022 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants