-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List of issues causing CircleCI failures #1117
Comments
This makes the node open a listening port right away, rather than waiting for the first discovery round to complete. It increases the node's connectivity to other nodes which may want to establish a connection, and consequently reduces CI failures. Part of: bosagora#1117
Some further ideas to improve the CircleCI succeeding:
|
The runTask() call will spawn a new fiber and suspend the calling fiber and switch to the new fiber. Using setTimer with a zero timeout (a.k.a. schedule()) makes the node open a listening port right away, rather than waiting for the first discovery round to complete. It increases the node's connectivity to other nodes which may want to establish a connection, and consequently reduces CI failures. Part of: bosagora#1117
The runTask() call will spawn a new fiber and and switch to the new fiber immediately. However we want to start listening for incoming requests before doing network discovery. Using setTimer with a zero timeout (a.k.a. schedule()) makes the node open a listening port right away, rather than waiting for the first discovery round to complete. It increases the node's connectivity to other nodes which may want to establish a connection, and consequently reduces CI failures. Part of: bosagora#1117
The runTask() call will spawn a new fiber and and switch to the new fiber immediately. However we want to start listening for incoming requests before doing network discovery. Using setTimer with a zero timeout (a.k.a. schedule()) makes the node open a listening port right away, rather than waiting for the first discovery round to complete. It increases the node's connectivity to other nodes which may want to establish a connection, and consequently reduces CI failures. Part of: #1117
I think I see another issue: After the initial network discovery is complete, 5 seconds later the same set of keys will be passed to the node. But in the NetworkManager we don't check if we're already connected to the nodes with the given keys: https://github.com/bpfkorea/agora/blob/2f4efaa190d0d2a5c42b79d183d9c11222716976/source/agora/network/NetworkManager.d#L421 |
Another issue: Sometimes a FullNode rejects a block received from getBlocksFrom. Unfortunately the log item was a trace call and hidden from the integration tests: log.trace("Rejected block: {}: {}", fail_reason, block.prettify()); |
Here one node tries to connect to another node and call
However the node at this port only starts listening 3 seconds later:
Now the question is what happened to the timeouts.. |
I found what triggers the bug. Here's my debugging log:
tl;dr: The nodes each open their ports at different times (sometimes up to a 5 second difference - because of the way docker-compose works). Then when one node tries to connect to another node whose port is not opened yet the first request will throw an exception, but the second request to this node will not time out. It seems that when this exception is thrown:
The first request to the given address failed will this exception, however the second request does not throw this exception. The request doesn't time out either. I need to add more logging and see what actually happens to the second request to the closed port. I think this is a vibe.d bug. |
So if we waited for 10 secs before trying to connect to any nodes we probably would not fail on first request. |
Yes but we have request retries in place, so this should all be handled gracefully already. |
I just meant to prove your theory that second attempts are not being handled correctly. Although I guess we should add a test to check the case where a node comes on line after the first check has happened. This way we can reproduce the error without running multiple times until it fails. |
Yes it would be good to reproduce it properly. |
This call never returns on the second request to the node: https://github.com/vibe-d/vibe-core/blob/b417214e325a7fb684817b4ae7b218eec46c11c1/source/vibe/core/net.d#L35 I've notice that It's odd that a timeout is not used there but is actually used in the call to It seems to me that |
I've asked Sonke: vibe-d/vibe.d#2347 (comment) Edit: He agrees that we should add a timeout. |
This commit fixes changes It seems to make CircleCI go green fairly reliably. However I still need to figure a better workaround for this: AndrejMitrovic/eventcore@862b5d4...2b66a20#diff-5515d1895d07303e45a19fac726f59ec. Without this workaround an assertion is triggered here: https://github.com/AndrejMitrovic/eventcore/blob/2b66a20961d3743bfeaeb08e87db71d970c653b7/source/eventcore/drivers/posix/dns.d#L76 - which means the handle that was returned from Of course the underlying problem is still there, namely And also there is still sometimes a failure on the line |
I opened some PRs:
However it might take some time until these are resolved & merged. Maybe we should just pause running the full system integration tests for the time being. |
This still does a basic build of Agora and runs -help on it, but stops short of testing the nodes until the issues in bosagora#1117 are resolved.
Are we still seeing failures here ? |
CircleCI has been stable for a few months, closing. |
HTTPStatusException
was not propagated properly (Fixed in #1109)Properly handle
HTTPStatusException
(Fixed in #1106)Request retry delays and timeouts might be too high (Fixed in #1118)
Some further ideas to improve the CircleCI succeeding:
retry_delay
option. This is essentially the "sleeping time" between each time a request is retried. Currently it's set to 1.5 seconds. I think this is too long of a waiting time between retries.max_retries
option so a node doesn't get prematurely banned..info
instead of.trace
. Right now our loggers only emitinfo
level messages, but the ban manager still usestrace
level - so if a node does get banned we won't see it in the logs in our CI.IP & Port when calling
registerListener
might be incorrectAs mentioned here: #1115 (comment)
However the following issue might be related:
listenHTTP
is called too late: (Fixed in #1118)Here the node attempts to connect with all other nodes, but only starts listening to requests after the first
discover()
call completes.This comment is wrong:
https://github.com/bpfkorea/agora/blob/331e51895b9faff1b591c2dcce9b93611d7a06a2/source/agora/node/Runner.d#L89
In fact this call is not asynchronous and it actually causes a context switch here: https://github.com/bpfkorea/agora/blob/331e51895b9faff1b591c2dcce9b93611d7a06a2/source/agora/node/Validator.d#L207
Then the node waits 5 seconds before attempting network discovery again:
https://github.com/bpfkorea/agora/blob/331e51895b9faff1b591c2dcce9b93611d7a06a2/source/agora/node/Validator.d#L212
The text was updated successfully, but these errors were encountered: