Memory leak introduced by cold flush #2015

Betula-L · 2019-10-22T10:56:00Z

Performance issues

What service is experiencing the performance issue? (M3Coordinator, M3DB, M3Aggregator, etc)
M3DB
Approximately how many datapoints per second is the service handling?
2889767 per minitue
What is the hardware configuration (number CPU cores, amount of RAM, disk size and types, etc) that the service is running on? Is the service the only process running on the host or is it colocated with other software?
40 core, 128G memory. m3db used about 100G memory.
What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).

  client:
    writeConsistencyLevel: majority
    readConsistencyLevel: unstrict_majority

  gcPercentage: 100

  writeNewSeriesAsync: true
  writeNewSeriesLimitPerSecond: 1048576
  writeNewSeriesBackoffDuration: 2ms

  bootstrap:
    # Intentionally disable peers bootstrapper to ensure it doesn't interfere with test.
    bootstrappers:
      - filesystem
      - commitlog
      - peers
      - uninitialized_topology

curl -X POST localhost:7201/api/v1/namespace -d '{
  "name": "prometheus-remote-storage",
  "options": {
    "bootstrapEnabled": true,
    "flushEnabled": true,
    "writesToCommitLog": true,
    "cleanupEnabled": true,
    "snapshotEnabled": true,
    "repairEnabled": false,
    "retentionOptions": {
      "retentionPeriodDuration": "48h",
      "blockSizeDuration": "2h",
      "bufferFutureDuration": "30m",
      "bufferPastDuration": "60m",
      "blockDataExpiry": true,
      "blockDataExpiryAfterNotAccessPeriodDuration": "5m"
    },
    "coldWritesEnabled": false,
    "indexOptions": {
      "enabled": true,
      "blockSizeDuration": "2h"
    }
  }
}'

What should be it?

I observed m3db memory usage for last 4 days, and it always increased slow, finally my machine was OOM. Whereas, retentionPeriodDuration is only 48h.

I tried to root cause where memory leaks, and found something interesting.

PR #1624 is implemented to support cold flushes, since that series should be saved more time for cold flush function Merge.

commit a115331 set tags NoFinalize, so series will not be gc immediately whether coldFlush is enabled or not.

m3/src/dbnode/storage/shard.go

Line 1100 in a115331

seriesTags.NoFinalize()

If coldFlush was enabled, tags will be Finalize() in function Merge eventually which is triggered by function ColdFlush, but if coldFlush was disabled, only function WarmFlush will be triggered. Thus, series will not be set Finalize() if coldFlush was disabled.

m3/src/dbnode/persist/fs/merger.go

Line 186 in 0c23eb6

res.Finalize()

To validate the below conclusion, i annotated this line, since i do not need save series if coldFlush was disabled, and run the hot-fix version about 6h. There is no significant memory increment anymore.

Heap Profiles

Comparison at the same load:
Memory usage on m3dbnode v1.14.1:

Memory usage if annotated NoFinalize:

m3/src/dbnode/storage/shard.go

Line 1100 in a115331

seriesTags.NoFinalize()

The following files include heap dump for last 4day.
run-50m.zip
run-6h.zip
run-4d.zip
run-1d.zip

The text was updated successfully, but these errors were encountered:

robskillington · 2019-10-27T01:20:46Z

This is solid investigation, TY @Betula-L.

There's definitely a clear issue here, we're testing your suggested fix, and also investigating if series themselves are being released properly as well in tickAndExpire (which could similarly cause the same leak if the read/write ref count never falls below 0).

robskillington · 2020-01-07T19:13:33Z

I believe this issue actually was addressed by our change so that index blocks don't stay mapped to RSS over time, fixed with this PR here: #2037

Once we release 0.15.0 would you be open to help test our changes @Betula-L ?

Betula-L · 2020-01-14T03:09:10Z

Sorry for the late reply. I run master version for a week. This memory leak is fixed by #2037.

A comprehensive testing in our system is planing now. I will report the results if it made progress.

Betula-L changed the title ~~High memory occupied by redundant tags copy~~ Memory leak introduced by cold flush Oct 25, 2019

Betula-L mentioned this issue Oct 28, 2019

Why m3db node OOM #2004

Closed

gibbscullen closed this as completed Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak introduced by cold flush #2015

Memory leak introduced by cold flush #2015

Betula-L commented Oct 22, 2019 •

edited

Loading

robskillington commented Oct 27, 2019

robskillington commented Jan 7, 2020 •

edited

Loading

Betula-L commented Jan 14, 2020

Memory leak introduced by cold flush #2015

Memory leak introduced by cold flush #2015

Comments

Betula-L commented Oct 22, 2019 • edited Loading

Performance issues

What should be it?

Heap Profiles

robskillington commented Oct 27, 2019

robskillington commented Jan 7, 2020 • edited Loading

Betula-L commented Jan 14, 2020

Betula-L commented Oct 22, 2019 •

edited

Loading

robskillington commented Jan 7, 2020 •

edited

Loading