Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/hostmetrics] Add available memory state (optionally) #19149

Conversation

lootek
Copy link
Contributor

@lootek lootek commented Mar 1, 2023

Description:

Add available memory state to hostmetricsreceiver.
Add enable_additional_memory_states configuration option to memoryscraper that controls reporting of memory state metrics: slab_reclaimable, slab_unreclaimable and available. This is because when those are enabled, all the memory states don't sum up well (the result exceeds total memory).

Link to tracking Issue:
#7417

Testing:
Updated the relevant memoryscraper unit tests.

Documentation:
The hostmetricsreceiver/memoryscraper docs were updated accordingly.

@lootek lootek requested a review from a team March 1, 2023 07:02
@lootek lootek requested a review from dmitryax as a code owner March 1, 2023 07:02
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 1, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@lootek lootek changed the title [receiver/hostmetrics] Add available memory state [receiver/hostmetrics] Add available memory state (optionally) Mar 1, 2023
@runforesight
Copy link

runforesight bot commented Mar 12, 2023

Foresight Summary

    
Major Impacts

build-and-test duration(36 minutes 41 seconds) has decreased 27 minutes 47 seconds compared to main branch avg(1 hour 4 minutes 28 seconds).
View More Details

⭕  build-and-test-windows workflow has finished in 5 seconds (41 minutes 41 seconds less than main branch avg.) and finished at 14th Mar, 2023.


Job Failed Steps Tests
windows-unittest-matrix -     🔗  N/A See Details
windows-unittest -     🔗  N/A See Details

✅  telemetrygen workflow has finished in 1 minute 16 seconds (51 seconds less than main branch avg.) and finished at 14th Mar, 2023.


Job Failed Steps Tests
build-dev -     🔗  N/A See Details
publish-latest -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details

✅  check-links workflow has finished in 1 minute 27 seconds and finished at 14th Mar, 2023.


Job Failed Steps Tests
changed files -     🔗  N/A See Details
check-links -     🔗  N/A See Details

❌  build-and-test workflow has finished in 36 minutes 41 seconds (27 minutes 47 seconds less than main branch avg.) and finished at 14th Mar, 2023. 1 job failed.


Job Failed Steps Tests
unittest-matrix (1.19, connector) -     🔗  ✅ 113  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, connector) -     🔗  ✅ 113  ❌ 0  ⏭ 0    🔗 See Details
correctness-metrics -     🔗  ✅ 2  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, internal) -     🔗  ✅ 583  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, internal) -     🔗  ✅ 583  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, extension) -     🔗  ✅ 538  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, processor) -     🔗  ✅ 1545  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, extension) -     🔗  ✅ 538  ❌ 0  ⏭ 0    🔗 See Details
correctness-traces -     🔗  ✅ 17  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, processor) -     🔗  ✅ 1545  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, receiver-0) -     🔗  ✅ 2609  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-0) -     🔗  ✅ 2609  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, exporter) -     🔗  ✅ 2488  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, exporter) -     🔗  ✅ 2488  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-1) -     🔗  ✅ 1940  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, receiver-1) -     🔗  ✅ 1940  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, other) -     🔗  ✅ 4728  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, other) -     🔗  ✅ 4728  ❌ 0  ⏭ 0    🔗 See Details
integration-tests -     🔗  ✅ 55  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details
check-codeowners -     🔗  N/A See Details
check-collector-module-version -     🔗  N/A See Details
checks CodeGen     🔗  N/A See Details
lint-matrix (receiver-0) -     🔗  N/A See Details
lint-matrix (receiver-1) -     🔗  N/A See Details
lint-matrix (processor) -     🔗  N/A See Details
lint-matrix (exporter) -     🔗  N/A See Details
lint-matrix (extension) -     🔗  N/A See Details
lint-matrix (connector) -     🔗  N/A See Details
lint-matrix (internal) -     🔗  N/A See Details
lint-matrix (other) -     🔗  N/A See Details
build-examples -     🔗  N/A See Details
lint -     🔗  N/A See Details
unittest (1.20) -     🔗  N/A See Details
unittest (1.19) -     🔗  N/A See Details
cross-compile (darwin, amd64) -     🔗  N/A See Details
cross-compile (darwin, arm64) -     🔗  N/A See Details
cross-compile (linux, 386) -     🔗  N/A See Details
cross-compile (linux, amd64) -     🔗  N/A See Details
cross-compile (linux, arm) -     🔗  N/A See Details
cross-compile (linux, arm64) -     🔗  N/A See Details
cross-compile (linux, ppc64le) -     🔗  N/A See Details
cross-compile (windows, amd64) -     🔗  N/A See Details
cross-compile (windows, 386) -     🔗  N/A See Details
build-package (rpm) -     🔗  N/A See Details
windows-msi -     🔗  N/A See Details
build-package (deb) -     🔗  N/A See Details
publish-check -     🔗  N/A See Details
publish-dev -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details

✅  changelog workflow has finished in 2 minutes 39 seconds and finished at 14th Mar, 2023.


Job Failed Steps Tests
changelog -     🔗  N/A See Details

✅  load-tests workflow has finished in 6 minutes 49 seconds (6 minutes 27 seconds less than main branch avg.) and finished at 14th Mar, 2023.


Job Failed Steps Tests
loadtest (TestTraceAttributesProcessor) -     🔗  ✅ 3  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestIdleMode) -     🔗  ✅ 1  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetric10kDPS|TestMetricsFromFile) -     🔗  ✅ 6  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceNoBackend10kSPS|TestTrace1kSPSWithAttrs) -     🔗  ✅ 8  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetricResourceProcessor|TestTrace10kSPS) -     🔗  ✅ 12  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestBallastMemory|TestLog10kDPS) -     🔗  ✅ 18  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details
loadtest (TestTraceBallast1kSPSWithAttrs|TestTraceBallast1kSPSAddAttrs) -     🔗  ✅ 10  ❌ 0  ⏭ 0    🔗 See Details

✅  prometheus-compliance-tests workflow has finished in 3 minutes 31 seconds (4 minutes less than main branch avg.) and finished at 14th Mar, 2023.


Job Failed Steps Tests
prometheus-compliance-tests -     🔗  ✅ 21  ❌ 0  ⏭ 0    🔗 See Details

✅  e2e-tests workflow has finished in 13 minutes 28 seconds and finished at 14th Mar, 2023.


Job Failed Steps Tests
kubernetes-test (v1.26.0) -     🔗  N/A See Details
kubernetes-test (v1.25.3) -     🔗  N/A See Details
kubernetes-test (v1.24.7) -     🔗  N/A See Details
kubernetes-test (v1.23.13) -     🔗  N/A See Details

🔎 See details on Foresight

*You can configure Foresight comments in your organization settings page.

Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a safe solution, but not 100% sure it's sustainable in the long run.

@rogercoll @djaglowski WDYT?

Another option would be to introduce additional optional metrics like system.linux.memory.slab and system.memory.available. BTW is the available state value useful on other platforms? If not, it can be system.linux.memory.available

@lootek
Copy link
Contributor Author

lootek commented Mar 13, 2023

BTW is the available state value useful on other platforms?

Looks like it is: https://github.com/shirou/gopsutil/blob/c7c548d6bdb46b39cd36ae0815926de70ab1b8ca/mem/mem.go#L22

@@ -88,6 +88,15 @@ load:
cpu_average: <false|true>
```

### Memory

`enable_additional_memory_states` specifies whether to enable reporting of a couple of additional memory state metrics: `slab_reclaimable`, `slab_unreclaimable` and `available`. When those are enabled, all the memory states don't sum up well (the result exceeds `total` memory) (default: `false`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a hill I'm going to die on, but in my opinion we should be looking to replace this metric with one or more metrics that use attributes appropriately to differentiate portions of a meaningful total.

I'm not an expert at memory management but it seems someone should be able to clearly state the relationships between these values. As a (probably wrong) starting point:

  • total = buffered + cached + inactive + free + used
  • slab = slab_reclaimable + slab_unreclaimable
  • available = free + slab_reclaimable?

If we had a list of these relationships, it should be fairly easy to identify a correct data model. Without it, I don't know how to evaluate such a solution except to say that overloading a metric has never been necessary elsewhere, as far as I am aware. Again, this isn't intended to be blocking - just trying to articulate why in my opinion we are having to invent novel ways to tweak a metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right, we're dealing with a complex issue that doesn't have a simple and universal solution. However, I believe that including the ability to report available memory is a valuable feature for users, despite its potential drawbacks.

@lootek lootek requested a review from dmitryax March 13, 2023 16:11
@rogercoll
Copy link
Contributor

Another option would be to introduce additional optional metrics like system.linux.memory.slab and system.memory.available. BTW is the available state value useful on other platforms? If not, it can be system.linux.memory.available

Memory available only has a unique value for Linux systems, as it was introduced by the kernel in version 3.16. For other platforms, it seems that Gopsutils is computing the value from others:

I would consider introducing system.linux or only provide the metric for Linux systems to prevent duplications. For other platforms, we could add some documentation on how to extract the equivalent value.

@lootek
Copy link
Contributor Author

lootek commented Mar 14, 2023

@rogercoll thank you for sharing your thoughts on this topic! I do agree with some of what you've said, but there's another issue that I think we should consider as well. While it's true that the Linux kernel provides the available memory state exclusively, there are also memory states that are specific to other OSes. To me, unifying our approach would require some significant changes to the current memory scraper metrics/codebase. What do you think? Do you agree it'd be better to address this as a separate issue or pull request?

@lootek
Copy link
Contributor Author

lootek commented Mar 21, 2023

@dmitryax @djaglowski what do you think?

@dmitryax
Copy link
Member

I'd lean toward introducing new optional metrics system.memory.slab and system.linux.memory.available instead of having optional attributes violating the "aggregation meaningfulness" requirement.

It should provide users with a better source in case they want to utilize both available and total linux memory metrics.

@lootek
Copy link
Contributor Author

lootek commented Mar 23, 2023

I'd lean toward introducing new optional metrics system.memory.slab and system.linux.memory.available instead of having optional attributes violating the "aggregation meaningfulness" requirement.

It should provide users with a better source in case they want to utilize both available and total linux memory metrics.

Fine, I can rewrite my PR according to this then. Just to be sure:

  • system.memory.slab is going to be a new metric with no differentiation between reclaimable and unreclaimable (having those available from the previous place). Also, to my understanding slab refers to the memory used by kernel specifically so it should be probably also available for linux (or unix-based) OSes only?
  • system.linux.memory.available is gonna be reported for linux only? What about the other unix-based OSes that have memory available supported by gopsutil?

@dmitryax
Copy link
Member

system.memory.slab is going to be a new metric with no differentiation between reclaimable and unreclaimable (having those available from the previous place). Also, to my understanding slab refers to the memory used by kernel specifically so it should be probably also available for linux (or unix-based) OSes only?

Right, this should be available on Linux only, but I don't think we need to have linux prefix in the name because we don't have a conflicting meaning of slab as per the spec. The metric should have the same state attribute with reclaimable and unreclaimable values.

system.linux.memory.available is gonna be reported for linux only? What about the other unix-based OSes that have memory available supported by gopsutil?

As pointed out by @rogercoll, it looks like the metric value is unique for Linux only, and other Unix-based OSes like FreeBSD can get that value deferred from other states.

I'd like to get an agreement on this from @rogercoll and @djaglowski before starting implementation. We should also start by adding system.memory.slab to the spec to get more people involved.

Also, these issues should be handled as separate follow-up PRs. Let's keep this one for the discussion.

And thank you, @lootek, for helping with that! I think we are getting closer to finally resolving #7417

@lootek
Copy link
Contributor Author

lootek commented Mar 24, 2023

system.memory.slab is going to be a new metric with no differentiation between reclaimable and unreclaimable (having those available from the previous place). Also, to my understanding slab refers to the memory used by kernel specifically so it should be probably also available for linux (or unix-based) OSes only?

Right, this should be available on Linux only, but I don't think we need to have linux prefix in the name because we don't have a conflicting meaning of slab as per the spec. The metric should have the same state attribute with reclaimable and unreclaimable values.

So is this gonna duplicate what we currently have? Or replace the existing memory states at some point?

system.linux.memory.available is gonna be reported for linux only? What about the other unix-based OSes that have memory available supported by gopsutil?

As pointed out by @rogercoll, it looks like the metric value is unique for Linux only, and other Unix-based OSes like FreeBSD can get that value deferred from other states.

ok, this makes sense. Unless we want to provide memory available for those as a kind of "handy shortcut".

I'd like to get an agreement on this from @rogercoll and @djaglowski before starting implementation. We should also start by adding system.memory.slab to the spec to get more people involved.

Good point, for sure we should!

Also, these issues should be handled as separate follow-up PRs. Let's keep this one for the discussion.

ok 👍

And thank you, @lootek, for helping with that! I think we are getting closer to finally resolving #7417

My pleasure :) And if possible I'd like to take part in having this finally resolved.

@dmitryax
Copy link
Member

So is this gonna duplicate what we currently have? Or replace the existing memory states at some point?

We will need to remove slab_reclaimable and slab_unreclaimable states from system.memory.usage metric and move them to system.memory.slab.

@lootek
Copy link
Contributor Author

lootek commented Mar 27, 2023

So is this gonna duplicate what we currently have? Or replace the existing memory states at some point?

We will need to remove slab_reclaimable and slab_unreclaimable states from system.memory.usage metric and move them to system.memory.slab.

Since such removal would be a breaking change I guess it'll be a typical 2-stage process (first the deprecation, then actiual removal)? Or are we gonna add a config setting like I did here in this PR?

@rogercoll
Copy link
Contributor

rogercoll commented Mar 27, 2023

I would not tie the metrics with the values provided by Gopsutils (the underlying library could be changed), instead, to the ones provided by each kernel. If a user requires those metrics to be provided for all the systems by the collector, the user could use the metrics generation processor to generate the new ones (at the moment does not support metric attributes, I can open an issue/pr for that) as they are currently being provided.

Specs issue to add the new system.memory.slab metric into the specification, which will solve the first issue of system.memory states not summing up to the total value. Please feel free to add extra information or change the proposal.

@dmitryax
Copy link
Member

Since such removal would be a breaking change I guess it'll be a typical 2-stage process (first the deprecation, then actiual removal)? Or are we gonna add a config setting like I did here in this PR?

Right, it should be 2-step approach

  1. Add system.memory.slab (optional) + add a warning for system.memory.usage saying that slab attributes are deprecated and will be removed, use system.memory.slab instead.
  2. Remove slab attributes from system.memory.usage

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label May 26, 2023
@dmitryax dmitryax removed the Stale label May 26, 2023
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jun 10, 2023
@github-actions
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants