Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hoststats: add package for collecting host statistics including cpu memory and disk usage #17038

Merged
merged 9 commits into from
May 30, 2023

Conversation

nickethier
Copy link
Member

@nickethier nickethier commented Apr 19, 2023

Description

This PR adds the ability for Consul to report host statistics through go-metrics. These metrics include cpu utilization by core and in aggregate, system memory usage and disk usage for the disk backing the configured data_dir.

Testing & Reproduction steps

Tests were added to assert cpu utilization calculation is performed accurately. Manual testing was performed on linux and darwin to ensure reported host stats values were accurate.

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

@github-actions github-actions bot added the theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics label Apr 19, 2023
@nickethier nickethier force-pushed the telemetry/host-stats branch from 2fdd9e4 to 418f04b Compare May 12, 2023 20:18
@nickethier nickethier marked this pull request as ready for review May 12, 2023 20:23
@nickethier nickethier requested a review from a team as a code owner May 12, 2023 20:23
@nickethier nickethier force-pushed the telemetry/host-stats branch 2 times, most recently from 488a7e6 to a9a5165 Compare May 16, 2023 23:58
@nickethier nickethier requested review from loshz, jmurret and Achooo May 17, 2023 18:17
loshz
loshz previously requested changes May 19, 2023
Copy link
Contributor

@loshz loshz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this is a good first attempt that we can iterate on in the future. I've left a couple of nits/change requests, and I think you might have missed a few steps from this doc.

Couple of open questions:

  • Are we defaulting to host metrics being enabled or disabled?
  • Can you upload some sort of sample of the metrics? As right now it's hard for me to test this locally.
  • We should really add some sort of integration testing strategy. I know it's borderline impossible to assert these metrics, but we can at least attempt to run the collector on different platforms.
  • We should probably add a document that outlines the metrics we are collecting and the format as well as reasoning.

.changelog/17038.txt Outdated Show resolved Hide resolved
agent/setup.go Outdated Show resolved Hide resolved
agent/setup.go Outdated Show resolved Hide resolved
lib/hoststats/collector.go Outdated Show resolved Hide resolved
lib/hoststats/collector.go Outdated Show resolved Hide resolved
lib/hoststats/cpu.go Outdated Show resolved Hide resolved
lib/hoststats/cpu.go Outdated Show resolved Hide resolved
"github.com/stretchr/testify/require"
)

func TestHostStats_CPU(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of a better approach, but I don't like the thought of this being a unit test. We're essentially spinning a CPU core to generate some load, and this took almost 13s to run on my local machine.

Maybe we need a whole new integration test suite for this kind of stuff?

lib/hoststats/host.go Outdated Show resolved Hide resolved
website/content/docs/agent/telemetry.mdx Outdated Show resolved Hide resolved
@nickethier
Copy link
Member Author

Thanks for the review @loshz! I addressed your comments and answered your questions below.

Overall, I think this is a good first attempt that we can iterate on in the future. I've left a couple of nits/change requests, >and I think you might have missed a few steps from this doc.

Hmm I'm not sure I see a step I missed. There is no RuntimeConfig change because the field gets parsed into the TelemetryConfig struct like other telemetry fields.

Couple of open questions:

  • Are we defaulting to host metrics being enabled or disabled?

Yes defaulting to enabled. go-psutil takes care of the platform specifics.

  • Can you upload some sort of sample of the metrics? As right now it's hard for me to test this locally.

Sure heres what I've been testing with:

# starting a dev agent some extra config to enable prometheus format
$> consul agent -dev -hcl 'telemetry{ prometheus_retention_time = "1h" disable_hostname = true } data_dir="/Users/nethier/tmp"'

# in another terminal, query the metrics for any consul_host prefixed metrics
$> curl 'http://localhost:8500/v1/agent/metrics?format=prometheus' | grep consul_host
# HELP consul_host_cpu_idle Idle cpu utilization
# TYPE consul_host_cpu_idle gauge
consul_host_cpu_idle 0
consul_host_cpu_idle{cpu="cpu0"} 81.93743896484375
consul_host_cpu_idle{cpu="cpu1"} 99.5999984741211
consul_host_cpu_idle{cpu="cpu10"} 94.08817291259766
consul_host_cpu_idle{cpu="cpu11"} 99.69999694824219
consul_host_cpu_idle{cpu="cpu12"} 93.98797607421875
consul_host_cpu_idle{cpu="cpu13"} 99.69999694824219
consul_host_cpu_idle{cpu="cpu14"} 96.39639282226562
consul_host_cpu_idle{cpu="cpu15"} 99.80000305175781
consul_host_cpu_idle{cpu="cpu2"} 86.8951644897461
consul_host_cpu_idle{cpu="cpu3"} 99.69969940185547
consul_host_cpu_idle{cpu="cpu4"} 89.74874114990234
consul_host_cpu_idle{cpu="cpu5"} 99.79959869384766
consul_host_cpu_idle{cpu="cpu6"} 91.3741226196289
consul_host_cpu_idle{cpu="cpu7"} 99.69999694824219
consul_host_cpu_idle{cpu="cpu8"} 92.8786392211914
consul_host_cpu_idle{cpu="cpu9"} 99.60040283203125
# HELP consul_host_cpu_iowait Iowait cpu utilization
# TYPE consul_host_cpu_iowait gauge
consul_host_cpu_iowait 0
consul_host_cpu_iowait{cpu="cpu0"} 0
consul_host_cpu_iowait{cpu="cpu1"} 0
consul_host_cpu_iowait{cpu="cpu10"} 0
consul_host_cpu_iowait{cpu="cpu11"} 0
consul_host_cpu_iowait{cpu="cpu12"} 0
consul_host_cpu_iowait{cpu="cpu13"} 0
consul_host_cpu_iowait{cpu="cpu14"} 0
consul_host_cpu_iowait{cpu="cpu15"} 0
consul_host_cpu_iowait{cpu="cpu2"} 0
consul_host_cpu_iowait{cpu="cpu3"} 0
consul_host_cpu_iowait{cpu="cpu4"} 0
consul_host_cpu_iowait{cpu="cpu5"} 0
consul_host_cpu_iowait{cpu="cpu6"} 0
consul_host_cpu_iowait{cpu="cpu7"} 0
consul_host_cpu_iowait{cpu="cpu8"} 0
consul_host_cpu_iowait{cpu="cpu9"} 0
# HELP consul_host_cpu_system System cpu utilization
# TYPE consul_host_cpu_system gauge
consul_host_cpu_system 0
consul_host_cpu_system{cpu="cpu0"} 6.35721492767334
consul_host_cpu_system{cpu="cpu1"} 0.30000001192092896
consul_host_cpu_system{cpu="cpu10"} 2.605210304260254
consul_host_cpu_system{cpu="cpu11"} 0.20000000298023224
consul_host_cpu_system{cpu="cpu12"} 2.8056111335754395
consul_host_cpu_system{cpu="cpu13"} 0.30000001192092896
consul_host_cpu_system{cpu="cpu14"} 1.30130136013031
consul_host_cpu_system{cpu="cpu15"} 0.20000000298023224
consul_host_cpu_system{cpu="cpu2"} 4.536290168762207
consul_host_cpu_system{cpu="cpu3"} 0.20020020008087158
consul_host_cpu_system{cpu="cpu4"} 3.718592882156372
consul_host_cpu_system{cpu="cpu5"} 0.20040079951286316
consul_host_cpu_system{cpu="cpu6"} 2.808425188064575
consul_host_cpu_system{cpu="cpu7"} 0.20000000298023224
consul_host_cpu_system{cpu="cpu8"} 2.407221555709839
consul_host_cpu_system{cpu="cpu9"} 0.2997002899646759
# HELP consul_host_cpu_total Total cpu utilization
# TYPE consul_host_cpu_total gauge
consul_host_cpu_total 0
consul_host_cpu_total{cpu="cpu0"} 18.062562942504883
consul_host_cpu_total{cpu="cpu1"} 0.4000000059604645
consul_host_cpu_total{cpu="cpu10"} 5.911823749542236
consul_host_cpu_total{cpu="cpu11"} 0.30000001192092896
consul_host_cpu_total{cpu="cpu12"} 6.01202392578125
consul_host_cpu_total{cpu="cpu13"} 0.30000001192092896
consul_host_cpu_total{cpu="cpu14"} 3.6036036014556885
consul_host_cpu_total{cpu="cpu15"} 0.20000000298023224
consul_host_cpu_total{cpu="cpu2"} 13.104838371276855
consul_host_cpu_total{cpu="cpu3"} 0.3003003001213074
consul_host_cpu_total{cpu="cpu4"} 10.251255989074707
consul_host_cpu_total{cpu="cpu5"} 0.20040079951286316
consul_host_cpu_total{cpu="cpu6"} 8.625877380371094
consul_host_cpu_total{cpu="cpu7"} 0.30000001192092896
consul_host_cpu_total{cpu="cpu8"} 7.121364116668701
consul_host_cpu_total{cpu="cpu9"} 0.39960038661956787
# HELP consul_host_cpu_user User cpu utilization
# TYPE consul_host_cpu_user gauge
consul_host_cpu_user 0
consul_host_cpu_user{cpu="cpu0"} 11.705348014831543
consul_host_cpu_user{cpu="cpu1"} 0.10000000149011612
consul_host_cpu_user{cpu="cpu10"} 3.3066132068634033
consul_host_cpu_user{cpu="cpu11"} 0.10000000149011612
consul_host_cpu_user{cpu="cpu12"} 3.2064127922058105
consul_host_cpu_user{cpu="cpu13"} 0
consul_host_cpu_user{cpu="cpu14"} 2.302302360534668
consul_host_cpu_user{cpu="cpu15"} 0
consul_host_cpu_user{cpu="cpu2"} 8.568548202514648
consul_host_cpu_user{cpu="cpu3"} 0.10010010004043579
consul_host_cpu_user{cpu="cpu4"} 6.532663345336914
consul_host_cpu_user{cpu="cpu5"} 0
consul_host_cpu_user{cpu="cpu6"} 5.817452430725098
consul_host_cpu_user{cpu="cpu7"} 0.10000000149011612
consul_host_cpu_user{cpu="cpu8"} 4.714142322540283
consul_host_cpu_user{cpu="cpu9"} 0.09990009665489197
# HELP consul_host_disk_available Available bytes on disk
# TYPE consul_host_disk_available gauge
consul_host_disk_available 0
consul_host_disk_available{path="/Users/nethier/tmp"} 7.7038125056e+10
# HELP consul_host_disk_inodes_percent Percentage of disk inodes usage
# TYPE consul_host_disk_inodes_percent gauge
consul_host_disk_inodes_percent 0
consul_host_disk_inodes_percent{path="/Users/nethier/tmp"} 0.5227161049842834
# HELP consul_host_disk_size Size of disk in bytes
# TYPE consul_host_disk_size gauge
consul_host_disk_size 0
consul_host_disk_size{path="/Users/nethier/tmp"} 4.9996316672e+11
# HELP consul_host_disk_used Disk usage in bytes
# TYPE consul_host_disk_used gauge
consul_host_disk_used 0
consul_host_disk_used{path="/Users/nethier/tmp"} 4.22925041664e+11
# HELP consul_host_disk_used_percent Percentage of disk space usage
# TYPE consul_host_disk_used_percent gauge
consul_host_disk_used_percent 0
consul_host_disk_used_percent{path="/Users/nethier/tmp"} 84.59123992919922
# HELP consul_host_memory_available Available physical memory in bytes
# TYPE consul_host_memory_available gauge
consul_host_memory_available 1.3143601152e+10
# HELP consul_host_memory_free Free physical memory in bytes
# TYPE consul_host_memory_free gauge
consul_host_memory_free 3.174227968e+09
# HELP consul_host_memory_total Total physical memory in bytes
# TYPE consul_host_memory_total gauge
consul_host_memory_total 3.4359738368e+10
# HELP consul_host_memory_used Used physical memory in bytes
# TYPE consul_host_memory_used gauge
consul_host_memory_used 2.1216137216e+10
# HELP consul_host_memory_used_percent Percentage of physical memory in use
# TYPE consul_host_memory_used_percent gauge
consul_host_memory_used_percent 61.74708557128906
# HELP consul_host_uptime System uptime
# TYPE consul_host_uptime gauge
consul_host_uptime 633771
  • We should really add some sort of integration testing strategy. I know it's borderline impossible to assert these metrics, but we can at least attempt to run the collector on different platforms.

I refactored the test a bit to be more conservative. I'm hoping long term we can cover this in our e2e testing of the hcp telemetry components.

  • We should probably add a document that outlines the metrics we are collecting and the format as well as reasoning.

Any suggestions or prior art to help give me an idea of where and what level of detail something like this would be?

@nickethier nickethier requested a review from loshz May 23, 2023 02:16
@dhiaayachi
Copy link
Collaborator

👋 @nickethier,
This LGTM code wise.

Can you please give more details on the reasoning behind adding those metrics to be collected by Consul versus using another mean of collecting the host data (an external exporter/collector)?

The reason behind my question is that I think this add an extra concern around architecture compatibility. I see that the used library have a wide compatibility architecture range, but from Consul perspective we still need to verify that all the architectures that we support are working correctly and currently we don't have a great coverage for those different architectures.

I'm also not sure how accurate those metrics will be in K8S or in containers in general 🤔

@nickethier
Copy link
Member Author

The reason behind my question is that I think this add an extra concern around architecture compatibility. I see that the used library have a wide compatibility architecture range, but from Consul perspective we still need to verify that all the architectures that we support are working correctly and currently we don't have a great coverage for those different architectures.

As I understand this library (gopsutil) has been used in Consul for many years. However we are likely calling functions that had not been used before. That being said it is also used across many different architectures for similar purposes in Nomad, where the origins of this code was inspired from.

In cases where there isn't platform support I expect the metric to read as 0. For example the above sample is from my local machine running macOS. On this platform cpu iowait time is not supported.

I'm also not sure how accurate those metrics will be in K8S or in containers in general 🤔

I called this out in the docs. Perhaps there is room in the future to iterate on this and report any container limits in place but this is very platform specific and not something gopsutil supports and would have to implemented.

Copy link
Collaborator

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for answering my questions @nickethier 🙏

LGTM!

@nickethier nickethier enabled auto-merge (squash) May 29, 2023 16:38
@nickethier nickethier force-pushed the telemetry/host-stats branch from 4307668 to bf69a82 Compare May 30, 2023 18:32
@nickethier nickethier dismissed loshz’s stale review May 30, 2023 18:36

@loshz is OOO this week. changes and comments were reviewed by @dhiaayachi and @boxofrad (thanks!)

@nickethier nickethier merged commit 44f9013 into main May 30, 2023
@nickethier nickethier deleted the telemetry/host-stats branch May 30, 2023 18:43
@nickethier nickethier added backport/1.15 This release series is no longer active on CE. Use backport/ent/1.15. and removed pr/no-backport labels May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.15 This release series is no longer active on CE. Use backport/ent/1.15. theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants