Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak introduced by cold flush #2015

Closed
Betula-L opened this issue Oct 22, 2019 · 3 comments
Closed

Memory leak introduced by cold flush #2015

Betula-L opened this issue Oct 22, 2019 · 3 comments

Comments

@Betula-L
Copy link

Betula-L commented Oct 22, 2019

Performance issues

  1. What service is experiencing the performance issue? (M3Coordinator, M3DB, M3Aggregator, etc)
    M3DB

  2. Approximately how many datapoints per second is the service handling?
    2889767 per minitue

  3. What is the hardware configuration (number CPU cores, amount of RAM, disk size and types, etc) that the service is running on? Is the service the only process running on the host or is it colocated with other software?
    40 core, 128G memory. m3db used about 100G memory.

  4. What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).

  client:
    writeConsistencyLevel: majority
    readConsistencyLevel: unstrict_majority

  gcPercentage: 100

  writeNewSeriesAsync: true
  writeNewSeriesLimitPerSecond: 1048576
  writeNewSeriesBackoffDuration: 2ms

  bootstrap:
    # Intentionally disable peers bootstrapper to ensure it doesn't interfere with test.
    bootstrappers:
      - filesystem
      - commitlog
      - peers
      - uninitialized_topology
curl -X POST localhost:7201/api/v1/namespace -d '{
  "name": "prometheus-remote-storage",
  "options": {
    "bootstrapEnabled": true,
    "flushEnabled": true,
    "writesToCommitLog": true,
    "cleanupEnabled": true,
    "snapshotEnabled": true,
    "repairEnabled": false,
    "retentionOptions": {
      "retentionPeriodDuration": "48h",
      "blockSizeDuration": "2h",
      "bufferFutureDuration": "30m",
      "bufferPastDuration": "60m",
      "blockDataExpiry": true,
      "blockDataExpiryAfterNotAccessPeriodDuration": "5m"
    },
    "coldWritesEnabled": false,
    "indexOptions": {
      "enabled": true,
      "blockSizeDuration": "2h"
    }
  }
}'

What should be it?

I observed m3db memory usage for last 4 days, and it always increased slow, finally my machine was OOM. Whereas, retentionPeriodDuration is only 48h.

I tried to root cause where memory leaks, and found something interesting.

PR #1624 is implemented to support cold flushes, since that series should be saved more time for cold flush function Merge.

commit a115331 set tags NoFinalize, so series will not be gc immediately whether coldFlush is enabled or not.

seriesTags.NoFinalize()

If coldFlush was enabled, tags will be Finalize() in function Merge eventually which is triggered by function ColdFlush, but if coldFlush was disabled, only function WarmFlush will be triggered. Thus, series will not be set Finalize() if coldFlush was disabled.

res.Finalize()

To validate the below conclusion, i annotated this line, since i do not need save series if coldFlush was disabled, and run the hot-fix version about 6h. There is no significant memory increment anymore.

Heap Profiles

Comparison at the same load:
Memory usage on m3dbnode v1.14.1:
image

Memory usage if annotated NoFinalize:

seriesTags.NoFinalize()

image

The following files include heap dump for last 4day.
run-50m.zip
run-6h.zip
run-4d.zip
run-1d.zip

image

@Betula-L Betula-L changed the title High memory occupied by redundant tags copy Memory leak introduced by cold flush Oct 25, 2019
@robskillington
Copy link
Collaborator

This is solid investigation, TY @Betula-L.

There's definitely a clear issue here, we're testing your suggested fix, and also investigating if series themselves are being released properly as well in tickAndExpire (which could similarly cause the same leak if the read/write ref count never falls below 0).

@robskillington
Copy link
Collaborator

robskillington commented Jan 7, 2020

I believe this issue actually was addressed by our change so that index blocks don't stay mapped to RSS over time, fixed with this PR here: #2037

Once we release 0.15.0 would you be open to help test our changes @Betula-L ?

@Betula-L
Copy link
Author

Sorry for the late reply. I run master version for a week. This memory leak is fixed by #2037.

A comprehensive testing in our system is planing now. I will report the results if it made progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants