Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionManager Timeouts not enforced within reasonable bounds #3553

Closed
bolt12 opened this issue Dec 20, 2021 · 2 comments · Fixed by #3532
Closed

ConnectionManager Timeouts not enforced within reasonable bounds #3553

bolt12 opened this issue Dec 20, 2021 · 2 comments · Fixed by #3532
Assignees
Labels
connection-manager Issues / PRs related to connection-manager io-sim-discovered Issue discovered by IOSim test-failure a test failure

Comments

@bolt12
Copy link
Contributor

bolt12 commented Dec 20, 2021

From working on #3532 :

ouroboros-network-framework
Ouroboros.Network.Server2
connection_manager_timeouts_enforced: FAIL (1935.35s)
*** Failed! Falsified (after 60 tests and 126 shrinks):
0
ArbDataFlow Duplex
MultiNodeScript {mnsEvents = [StartClient 0s (TestAddr {unTestAddr = TestAddress 61}),StartClient 0s (TestAddr {unTestAddr = TestAddress 6}),InboundConnection 0s (TestAddr {unTestAddr = TestAddress 6}),InboundConnection 0.223744s (TestAddr {unTestAddr = TestAddress 61}),StartClient 0.51s (TestAddr {unTestAddr = TestAddress 88}),InboundMiniprotocols 0.26s (TestAddr {unTestAddr = TestAddress 61}) (Bundle {withHot = WithHot [24,24,24,24,24,0,0,24,0], withWarm = WithWarm [24,24,24,24,24,24,24,0,24], withEstablished = WithEstablished [24,24,24,0,0,24,24,24]})], mnsAttenuationMap = fromList [(TestAddr {unTestAddr = TestAddress 61},Script (AbsBearerInfo {abiConnectionDelay = SmallDelay, abiInboundAttenuation = NoAttenuation FastSpeed, abiOutboundAttenuation = SpeedAttenuation NormalSpeed (Time 8s) 568.233333333333s, abiInboundWriteFailure = Just 58, abiOutboundWriteFailure = Just 27, abiSDUSize = SmallSDU} :| []))]}


Connection transition trace:
(Time 0.323744s,TransitionTrace @(TestAddress 61) (UnknownConnectionSt → UnnegotiatedSt Inbound))
(Time 0.324009783447s,TransitionTrace @(TestAddress 61) (UnnegotiatedSt Inbound → InboundIdleSt Unidirectional))
(Time 0.993989338567s,TransitionTrace @(TestAddress 61) (InboundIdleSt Unidirectional → InboundSt Unidirectional))
(Time 1.000000133456s,TransitionTrace @(TestAddress 61) (InboundSt Unidirectional → TerminatedSt))
(Time 121.000000133456s,TransitionTrace @(TestAddress 61) (TerminatedSt → UnknownConnectionSt))

Connection transition trace:
(Time 1s,TransitionTrace @(TestAddress 6) (UnknownConnectionSt → UnnegotiatedSt Inbound))
(Time 1.000265783447s,TransitionTrace @(TestAddress 6) (UnnegotiatedSt Inbound → InboundIdleSt Unidirectional))
(Time 121.000000133456s,TransitionTrace @(TestAddress 6) (InboundIdleSt Unidirectional → TerminatedSt))
(Time 121.000000133456s,TransitionTrace @(TestAddress 6) (TerminatedSt → UnknownConnectionSt))

At transition: InboundIdleSt Unidirectional → TerminatedSt
First happened at: Time 1.000265783447s
Second happened at: Time 121.000000133456s
Should only take: 5.5s, but took:119.999734350009s
Use --quickcheck-replay=589091 to reproduce.

Looking at the logs I see the timer being registered and an EventTimerExpired for it appears at the correct time. It then proceeds triggering unregisterInboundConnection correctly at the correct time. But unregisterInboundConnection blocks on the ConnectionManager stateVar .

A bunch of time later it gets an AsyncCancelled exception and the cleanup function from forkConnecionHandler runs, which also blocks on the stateVar .

For timeout enforcement sake we should already unblocked ages ago, but there are multiple threads blocked on stateVar and the one that unblocks first is none of the two we needed. It appears that the threads that hold the lock block waiting on timers too and due to an unfortunate order of events we end up going way beyond the deadline.

The problem seems to be just concurrency and lengthy locks. From looking at the code this is due to the use of readTMVar on unregisterInboundConnection, which is blocking.

NOTE: There are other instances of this failure for other transitions all with the same root cause.

@bolt12 bolt12 added connection-manager Issues / PRs related to connection-manager io-sim-discovered Issue discovered by IOSim peer2peer test-failure a test failure labels Dec 20, 2021
@bolt12
Copy link
Contributor Author

bolt12 commented Dec 22, 2021

I found the source for my particular failing test.

It seems that due to attenuation the AttenuatedChannel resource vanished and one connection error-ed, which triggers the cleanup function from forkConnectionHandler . This function puts the connection state in TerminatedSt and closes the socket. However this cleanup function holds the lock for the ConnectionManager stateVar for the whole code block (see modifyTMVar here). Right after tracing the transition to TerminatedSt it proceeds closing the socket as I said, but this function is blocking! It can block while closing the attenuated channels for very long time (see here)! This will only block for a long time if the channel is already closed, which I believe to be the case here since we got a resource vanished error.

@coot coot changed the title ConnectionManager Timeouts not enforced whitin reasonable bounds ConnectionManager Timeouts not enforced within reasonable bounds Jan 10, 2022
@coot
Copy link
Contributor

coot commented Jan 10, 2022

In the real implementation we reset the socket in a non-blocking way (SO_LINGER set with 0 interval); A quick fix is to make the attenuated channel comply, a nicer fix would be to teach the simulated snockets the SO_LINGER option.

@bolt12 bolt12 linked a pull request Jan 11, 2022 that will close this issue
@coot coot moved this to In Progress in Ouroboros Network Feb 2, 2022
@iohk-bors iohk-bors bot closed this as completed in #3532 Feb 3, 2022
Repository owner moved this from In Progress to Done in Ouroboros Network Feb 3, 2022
Repository owner moved this from In Progress to Done in P2P Connection Manager Verification Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connection-manager Issues / PRs related to connection-manager io-sim-discovered Issue discovered by IOSim test-failure a test failure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants