Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new m3db namespace causes partial cluster OOM #2155

Closed
mmedvede opened this issue Feb 18, 2020 · 2 comments · Fixed by #2169
Closed

Adding new m3db namespace causes partial cluster OOM #2155

mmedvede opened this issue Feb 18, 2020 · 2 comments · Fixed by #2169

Comments

@mmedvede
Copy link
Contributor

mmedvede commented Feb 18, 2020

This has been happening with both v0.14.2 and v0.15.0-rc0

Most recent example:

Cluster has 18 db nodes, RF=3. It has a few namespaces setup and is taking about 200k metrics / s. As soon as new namespace is added via namespace API, 14 out of 18 nodes bootstrap the new namespace without an issue. While 4 out of 18 see the new namespace added, but do not start bootstrap. At the same time the memory and goroutine count start to sharply rise on these 4 nodes until all of them OOM and start full bootstrap. There is nothing in log files of these nodes after they see the new namespace and until the process killed.

Screenshot_2020-02-18

m3dbnode-config.yml.zip
ns.json.zip
placement.json.zip

Was able to reproduce on a cluster with more than a few m3db nodes:

  1. Initialize a cluster with any namespace
  2. Let it ingest some data
  3. Add new namespace
  4. Observe bootstrap problems and OOM
@notbdu
Copy link
Contributor

notbdu commented Feb 19, 2020

Just to confirm, you saw no logs in the OOMed nodes pre-OOM and post namespace update?

For example no logs like:

"updating database namespace schema"

Trying to get an idea of where the namespace update gets stuck.

@mmedvede
Copy link
Contributor Author

Yes, nothing from dbnode process after namespace update and before OOM:

Feb 18 16:22:41 dbnode-816706098-12-852752506 m3dbnode[126899]: {"level":"warn","ts":1582042961.7061825,"msg":"skipping namespace removals and updates (except schema updates), restart process if you want changes to take effect."}
Feb 18 16:52:02 dbnode-816706098-12-852752506 m3dbnode[126899]: {"level":"info","ts":1582044722.5778766,"msg":"dynamic namespace registry updated to version","version":13}
Feb 18 16:52:02 dbnode-816706098-12-852752506 m3dbnode[126899]: {"level":"info","ts":1582044722.577958,"msg":"received update from kv namespace watch"}
Feb 18 16:52:02 dbnode-816706098-12-852752506 m3dbnode[126899]: {"level":"info","ts":1582044722.5780127,"msg":"updating database namespaces","adds":"[2d]","updates":"[]","removals":"[21d_dr, 500d_dr, 21d_dup_dr, 90d_dr]"}
Feb 18 16:57:10 dbnode-816706098-12-852752506 systemd[1]: m3dbnode.service: main process exited, code=killed, status=9/KILL
Feb 18 16:57:10 dbnode-816706098-12-852752506 systemd[1]: Unit m3dbnode.service entered failed state.
Feb 18 16:57:10 dbnode-816706098-12-852752506 systemd[1]: m3dbnode.service failed.
Feb 18 16:57:20 dbnode-816706098-12-852752506 systemd[1]: m3dbnode.service holdoff time over, scheduling restart.
Feb 18 16:57:20 dbnode-816706098-12-852752506 systemd[1]: Started "M3DB Timeseries Database".
Feb 18 16:57:20 dbnode-816706098-12-852752506 systemd[1]: Starting "M3DB Timeseries Database"...
Feb 18 16:57:20 dbnode-816706098-12-852752506 m3dbnode[69716]: 2020/02/18 16:57:20 Go Runtime version: go1.12.9
Feb 18 16:57:20 dbnode-816706098-12-852752506 m3dbnode[69716]: 2020/02/18 16:57:20 Build Version:      v0.15.0-rc.0
Feb 18 16:57:20 dbnode-816706098-12-852752506 m3dbnode[69716]: 2020/02/18 16:57:20 Build Revision:     8ffa7bed3
Feb 18 16:57:20 dbnode-816706098-12-852752506 m3dbnode[69716]: 2020/02/18 16:57:20 Build Branch:       master
Feb 18 16:57:20 dbnode-816706098-12-852752506 m3dbnode[69716]: 2020/02/18 16:57:20 Build Date:         2020-02-13-14:29:32
Feb 18 16:57:20 dbnode-816706098-12-852752506 m3dbnode[69716]: 2020/02/18 16:57:20 Build TimeUnix:     1581622172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants