Skip to content
This repository has been archived by the owner on Apr 5, 2024. It is now read-only.

Fix deadlock in CLA manager #40

Closed
wants to merge 1 commit into from

Conversation

CryptoCopter
Copy link
Member

While using dtn7-go in a large-ish setup, we found that the daemon would regularly lock up completely. We eventually traced the problem to the interaction between the cla-manager, manager-elem and cla itself (in our case mtcp, but the issue would potentially exist for any cla with synchronisation problems).

In the end it comes down to the fact that both unbuffered channel writes and the respective Close()-methods of the manager-elem and cla are blocking. The created a situation where the manager would call the manager-elem's Close()-method and block, while the manager-elem could not react to it, since it was blocked by a channel send to the manager which would also never resolve - thus a global deadlock.

Out first-order solution to this issue was to make the channel-writes for the reporting channels non-blocking, using a select-guard. In order to prevent message loss if there is not currently someone listening on the channel, we turned the channels from unbuffered to buffered channels (sizer of the buffer configurable using the ReportChannelBuffer const in manager.go.

C&C welcome.

We did this by making the reporting of status messages non-blocking
@CryptoCopter CryptoCopter added the bug Something isn't working label May 22, 2021
@CryptoCopter CryptoCopter marked this pull request as draft May 22, 2021 18:37
@oxzi
Copy link
Member

oxzi commented May 24, 2021

First, thanks a lot for all the work you have put into this.

However, the tests are failing with your changes. I even re-run them, but for both Go 1.13 as well as Go 1.16 the following happens:

--- FAIL: TestImplNetwork (289.52s)
    --- FAIL: TestImplNetwork/WebSocket-64-1-1024 (28.06s)
        impl_test.go:190: listener received 44 messages instead of 64

@CryptoCopter
Copy link
Member Author

CryptoCopter commented May 27, 2021

The error occurs only for the last testcase {"WebSocket", 64, 1, 1024, mkWsListener, mkWsClient}, when you comment that one out, the test passes (at least it did so on my machine...).

As to why this might be happening: in the test logs you can find many instances of this error message. outChnl full or no consumer. So, the bundle is received and passed to the Manager, but when it tries to write the event to its outChnl, the channel is full. This is probably due to the goroutine which is spawned as part of handleListener being unable to keep up with processing the events coming in.

If you increase the buffer size from 100 to 200, the test now passes, which also points to it being an issue of excessive load.

As to why this is happening for the test case with 64 WebSocket CLAs but not for 64 TCP CLAs, I honestly have no idea...

adur1990 pushed a commit that referenced this pull request Aug 16, 2021
As can be seen in recent issues and PRs (#39, #40, #41, #42, #43, #45), there
is an issue in dtn7-go that results in a deadlock after some time, making it
impossible to sent bundles. To be able to test and debug this issue, some more
extensive tests are required. Thus, we developed a test infrastructure to
simulate multiple virtual nodes using the
[CORE emulator](https://github.com/coreemu/core).
This tests environment is located in the
[dtn7-playground](https://github.com/dtn7/dtn7-playground) repository. To be
able to trigger these tests, this PR adds a workflow file, that uses the
[repository_dispatch event](https://docs.github.com/en/actions/reference/events-that-trigger-workflows#repository_dispatch)
sent to the dtn7-playground repo.
oxzi pushed a commit that referenced this pull request Aug 16, 2021
* Workflow for extended tests

As can be seen in recent issues and PRs (#39, #40, #41, #42, #43, #45), there
is an issue in dtn7-go that results in a deadlock after some time, making it
impossible to sent bundles. To be able to test and debug this issue, some more
extensive tests are required. Thus, we developed a test infrastructure to
simulate multiple virtual nodes using the
[CORE emulator](https://github.com/coreemu/core).
This tests environment is located in the
[dtn7-playground](https://github.com/dtn7/dtn7-playground) repository. To be
able to trigger these tests, this PR adds a workflow file, that uses the
[repository_dispatch event](https://docs.github.com/en/actions/reference/events-that-trigger-workflows#repository_dispatch)
sent to the dtn7-playground repo.

* Fix copyright header

Co-authored-by: Artur Sterz <[email protected]>
@pulsejet
Copy link

Please fix/merge this ... this great library is basically useless without this patch

@CryptoCopter
Copy link
Member Author

Sooo, it appears that this was actually an issue with the network-emulator we were using... since we haven't seen this problem in a while. dtnd appears to run stable without deadlocks under high load.

@CryptoCopter CryptoCopter deleted the deadlock-fix-upstream branch March 22, 2023 10:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants