Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Space Leak in deployed nodes #370

Closed
disassembler opened this issue Dec 3, 2019 · 4 comments · Fixed by input-output-hk/iohk-monitoring-framework#479
Closed

Space Leak in deployed nodes #370

disassembler opened this issue Dec 3, 2019 · 4 comments · Fixed by input-output-hk/iohk-monitoring-framework#479
Milestone

Comments

@disassembler
Copy link
Contributor

We're seeing a space leak in deployed nodes. The nodes get OOM killed as can be seen in the memory graph here: https://monitoring.awstest.iohkdev.io/grafana/d/Oe0reiHef/cardano-application-metrics-v2?orgId=1&refresh=1m

@tatyanavych
Copy link
Contributor

Staging snapshot shared on slack https://files.slack.com/files-pri/T0N639Z4N-FR64QTEJH/staging-gc-bytes.png 2 Dec 2019

@mrBliss
Copy link
Contributor

mrBliss commented Dec 9, 2019

I have looked into this and have found the following.

  • I can reproduce the space leak by running just a node that connects to a proxies. Just run scripts/mainnet.sh, no need to run the byron-proxy or cardano-explorer yourself.
  • By limiting the stack size to 2MB, I get a stack overflow after ~1 min on my
    machine. Exact commit: 3c7387d
  • When I rip out logging & monitoring, I no longer get a stack overflow even after running + 5min. @karknu also ran it for 13 min without a stack overflow. Exact commit: 3406ffe

My conclusion: the memory leak is in the logging/monitoring code, not network or consensus.

I think the right people from the logging/monitoring should investigate this further. For example, if the leak is inside a tracer, they could start by disabling some tracers until they find the one leaking memory.

@CodiePP
Copy link
Contributor

CodiePP commented Dec 9, 2019

cardano-node
I guess that the memory leak is related to some decoding into CBOR.
Please, investigate further.

@mrBliss
Copy link
Contributor

mrBliss commented Dec 10, 2019

Not that the problem DevOps diagnosed is that we're leaking stack space which should not be confused with memory space (the former is a subset of the latter, the latter also includes heap space). This measurement is too short to say anything about the slow but steady leaking of stack space.

mrBliss added a commit to input-output-hk/iohk-monitoring-framework that referenced this issue Dec 11, 2019
The local `qProc` function in `Cardano.BM.Backend.Switchboard` loops by
calling itself recursively, passing in the same `MVar MessageCounter` each
time. However, `MessageCounter` was missing a bang on its `mcCountersMap`
field, which contains `HM.HashMap Text Word64`. Even though the `HashMap` is a
strict one, if you don't force it, you're still accumulating thunks. And as
the `MVar` containing the `MessageCounter` was passed recursively, this
resulted in a stack overflow instead of running out of (heap) memory.

Fix it by adding the missing bang.

This should fix IntersectMBO/cardano-node#370.
iohk-bors bot added a commit to input-output-hk/iohk-monitoring-framework that referenced this issue Dec 11, 2019
479: Fix stack overflow by adding a missing bang r=dcoutts a=mrBliss

The local `qProc` function in `Cardano.BM.Backend.Switchboard` loops by
calling itself recursively, passing in the same `MVar MessageCounter` each
time. However, `MessageCounter` was missing a bang on its `mcCountersMap`
field, which contains `HM.HashMap Text Word64`. Even though the `HashMap` is a
strict one, if you don't force it, you're still accumulating thunks. And as
the `MVar` containing the `MessageCounter` was passed recursively, this
resulted in a stack overflow instead of running out of (heap) memory.

Fix it by adding the missing bang.

This should fix IntersectMBO/cardano-node#370.

Co-authored-by: Thomas Winant <[email protected]>
iohk-bors bot added a commit to input-output-hk/iohk-monitoring-framework that referenced this issue Dec 11, 2019
479: Fix stack overflow by adding a missing bang r=CodiePP a=mrBliss

The local `qProc` function in `Cardano.BM.Backend.Switchboard` loops by
calling itself recursively, passing in the same `MVar MessageCounter` each
time. However, `MessageCounter` was missing a bang on its `mcCountersMap`
field, which contains `HM.HashMap Text Word64`. Even though the `HashMap` is a
strict one, if you don't force it, you're still accumulating thunks. And as
the `MVar` containing the `MessageCounter` was passed recursively, this
resulted in a stack overflow instead of running out of (heap) memory.

Fix it by adding the missing bang.

This should fix IntersectMBO/cardano-node#370.

Co-authored-by: Thomas Winant <[email protected]>
CodiePP pushed a commit to input-output-hk/iohk-monitoring-framework that referenced this issue Dec 11, 2019
The local `qProc` function in `Cardano.BM.Backend.Switchboard` loops by
calling itself recursively, passing in the same `MVar MessageCounter` each
time. However, `MessageCounter` was missing a bang on its `mcCountersMap`
field, which contains `HM.HashMap Text Word64`. Even though the `HashMap` is a
strict one, if you don't force it, you're still accumulating thunks. And as
the `MVar` containing the `MessageCounter` was passed recursively, this
resulted in a stack overflow instead of running out of (heap) memory.

Fix it by adding the missing bang.

This should fix IntersectMBO/cardano-node#370.

Signed-off-by: Alexander Diemand <[email protected]>
@vhulchenko-iohk vhulchenko-iohk added this to the S2 2019-12-19 milestone Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants