Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of issues causing CircleCI failures #1117

Closed
AndrejMitrovic opened this issue Aug 13, 2020 · 15 comments
Closed

List of issues causing CircleCI failures #1117

AndrejMitrovic opened this issue Aug 13, 2020 · 15 comments
Labels
C.General An issue which doesn't fit in the other categories (not blockchain related) type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense

Comments

@AndrejMitrovic
Copy link
Contributor

AndrejMitrovic commented Aug 13, 2020

HTTPStatusException was not propagated properly (Fixed in #1109)


Properly handle HTTPStatusException (Fixed in #1106)


Request retry delays and timeouts might be too high (Fixed in #1118)

Some further ideas to improve the CircleCI succeeding:

  • Reduce the retry_delay option. This is essentially the "sleeping time" between each time a request is retried. Currently it's set to 1.5 seconds. I think this is too long of a waiting time between retries.
  • Increase the max_retries option so a node doesn't get prematurely banned.
  • Change the ban manager logging to use .info instead of .trace. Right now our loggers only emit info level messages, but the ban manager still uses trace level - so if a node does get banned we won't see it in the logs in our CI.

IP & Port when calling registerListener might be incorrect

As mentioned here: #1115 (comment)

However the following issue might be related:


listenHTTP is called too late: (Fixed in #1118)

node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,996 Info [agora.node.Validator] - start()
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,996 Info [agora.node.Validator] - startPeriodicDiscovery
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,996 Info [agora.node.Validator] - startPeriodicDiscovery inside task
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,996 Info [agora.network.NetworkManager] - Doing network discovery with peer keys: { _set: [ GDNODE4KTE7VQUHVBLXIGD7VEFY57X4XV547P72D37SDG7UEO7MWOSNY: true, GDNODE6ZXW2NNOOQIGN24MBEZRO5226LSMHGQA3MUAMYQSTJVR7XT6GH: true, GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM: true, GDNODE5T7TWJ2S4UQSTM7KDHU2HQHCJUXFYLPZDDYGXIBUAH3U3PJQC2: true, GDNODE7J5EUK7T6HLEO2FDUBWZEXVXHJO7C4AF5VZAKZENGQ4WR3IX2U: true ] }
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,997 Info [agora.network.NetworkManager] - About to connect to: http://node-3:3826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,997 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-3:3826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,997 Info [agora.network.NetworkManager] - About to connect to: http://node-4:4826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,997 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-4:4826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,06 Info [agora.network.NetworkManager] - About to connect to: http://node-0:1826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,06 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-0:1826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,06 Info [agora.network.NetworkManager] - About to connect to: http://node-5:5826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,06 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-5:5826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,08 Info [agora.network.NetworkManager] - About to connect to: http://node-6:6826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,08 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-6:6826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,15 Info [agora.network.NetworkManager] - About to connect to: http://node-7:7826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,15 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-7:7826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,15 Info [agora.node.Validator] - ~startPeriodicDiscovery
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,15 Info [agora.node.Runner] - About to listen to HTTP: 2826

Here the node attempts to connect with all other nodes, but only starts listening to requests after the first discover() call completes.

This comment is wrong:

https://github.com/bpfkorea/agora/blob/331e51895b9faff1b591c2dcce9b93611d7a06a2/source/agora/node/Runner.d#L89

In fact this call is not asynchronous and it actually causes a context switch here: https://github.com/bpfkorea/agora/blob/331e51895b9faff1b591c2dcce9b93611d7a06a2/source/agora/node/Validator.d#L207

Then the node waits 5 seconds before attempting network discovery again:
https://github.com/bpfkorea/agora/blob/331e51895b9faff1b591c2dcce9b93611d7a06a2/source/agora/node/Validator.d#L212

@AndrejMitrovic AndrejMitrovic added the type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense label Aug 13, 2020
@AndrejMitrovic AndrejMitrovic added this to the 2. Validator milestone Aug 13, 2020
AndrejMitrovic added a commit to AndrejMitrovic/agora that referenced this issue Aug 13, 2020
This makes the node open a listening port right away,
rather than waiting for the first discovery round
to complete.

It increases the node's connectivity to other nodes
which may want to establish a connection,
and consequently reduces CI failures.

Part of: bosagora#1117
@AndrejMitrovic
Copy link
Contributor Author

AndrejMitrovic commented Aug 13, 2020

Some further ideas to improve the CircleCI succeeding:

  • Do some performance measurements. Is there a particular reason why a request timed-out? Or is a Node's fiber stuck doing some heavy processing and stalling all other fibers? Etc.

AndrejMitrovic added a commit to AndrejMitrovic/agora that referenced this issue Aug 14, 2020
The runTask() call will spawn a new fiber and
suspend the calling fiber and switch to the new
fiber.

Using setTimer with a zero timeout (a.k.a. schedule())
 makes the node open a listening port right away,
rather than waiting for the first discovery round
to complete.

It increases the node's connectivity to other nodes
which may want to establish a connection,
and consequently reduces CI failures.

Part of: bosagora#1117
AndrejMitrovic added a commit to AndrejMitrovic/agora that referenced this issue Aug 14, 2020
The runTask() call will spawn a new fiber and
and switch to the new fiber immediately.
However we want to start listening for incoming
requests before doing network discovery.

Using setTimer with a zero timeout (a.k.a. schedule())
makes the node open a listening port right away,
rather than waiting for the first discovery round
to complete.

It increases the node's connectivity to other nodes
which may want to establish a connection,
and consequently reduces CI failures.

Part of: bosagora#1117
AndrejMitrovic added a commit that referenced this issue Aug 14, 2020
The runTask() call will spawn a new fiber and
and switch to the new fiber immediately.
However we want to start listening for incoming
requests before doing network discovery.

Using setTimer with a zero timeout (a.k.a. schedule())
makes the node open a listening port right away,
rather than waiting for the first discovery round
to complete.

It increases the node's connectivity to other nodes
which may want to establish a connection,
and consequently reduces CI failures.

Part of: #1117
@AndrejMitrovic
Copy link
Contributor Author

I think I see another issue:

https://github.com/bpfkorea/agora/blob/2f4efaa190d0d2a5c42b79d183d9c11222716976/source/agora/node/Validator.d#L210-L212

After the initial network discovery is complete, 5 seconds later the same set of keys will be passed to the node.

But in the NetworkManager we don't check if we're already connected to the nodes with the given keys: https://github.com/bpfkorea/agora/blob/2f4efaa190d0d2a5c42b79d183d9c11222716976/source/agora/network/NetworkManager.d#L421

@AndrejMitrovic
Copy link
Contributor Author

Another issue: Sometimes a FullNode rejects a block received from getBlocksFrom. Unfortunately the log item was a trace call and hidden from the integration tests:

log.trace("Rejected block: {}: {}", fail_reason, block.prettify());

@AndrejMitrovic
Copy link
Contributor Author

Here one node tries to connect to another node and call getPublicKey:

node-3_1_6cb3b1c7873b | 2020-08-14T07:42:45.038341153Z 2020-08-14 07:42:45,37 Info [agora.network.NetworkManager] - ConnectionTask: getPublicKey to: http://node-5:5826

However the node at this port only starts listening 3 seconds later:

node-5_1_6d09f00939f4 | REST route: POST2020-08-14 07:42:48,732 Info [agora.node.Runner] - About to listen to HTTP: 5826

Now the question is what happened to the timeouts..

@AndrejMitrovic
Copy link
Contributor Author

I found what triggers the bug. Here's my debugging log:

# main.d failure for node 7 at 05:06:06
core.exception.AssertError@tests/system/source/main.d(42): Check condition failed after timeout of 1 minute and 600 attempts2020-Aug-18 05:06:06.941664Z: Node 6 (this is node-7_1) has not completed discovery after 1 minute.

# node 7 last message at 05:05:04 (therefore thread did not freeze)
node-7_1  | 2020-08-18T05:05:04.737464254Z 2020-08-18 05:05:04,733 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-7:7826

# node 7 still missing node 5 at 05:05:03,734:
node-7_1  | 2020-08-18T05:05:04.737445312Z 2020-08-18 05:05:03,734 Info [agora.network.NetworkManager] - Found validator (1 missing - [GDNODE5T7TWJ2S4UQSTM7KDHU2HQHCJUXFYLPZDDYGXIBUAH3U3PJQC2])

# node 5 opened port at 05:05:03,630
node-5_1  | 2020-08-18T05:05:03.630845791Z 2020-08-18 05:05:03,630 Info [agora.node.Runner] - About to listen to HTTP: 5826

# node 7 called public key on 05:05:02,703
node-7_1  | 2020-08-18T05:05:02.704400058Z 2020-08-18 05:05:02,703 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-5:5826
node-7_1  | 2020-08-18T05:05:02.726375356Z 2020-08-18 05:05:02,725 Info [agora.network.NetworkClient] - Exception calling getPublicKey on http://node-5:5826: object.Exception@submodules/vibe-core/source/vibe/core/net.d(229): Failed to connect to 192.168.144.7:5826: refused
node-7_1  | 2020-08-18T05:05:02.726378277Z 2020-08-18 05:05:02,725 Info [agora.network.NetworkClient] - Will retry getPublicKey to http://node-5:5826 in 50 ms

# node 7 tries again the same request at  05:05:02,777
node-7_1  | 2020-08-18T05:05:02.780841934Z 2020-08-18 05:05:02,777 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-5:5826

# However there was no exception / timeout for this second request. There are many other messages about other requests so the thread did not freeze.

tl;dr: The nodes each open their ports at different times (sometimes up to a 5 second difference - because of the way docker-compose works). Then when one node tries to connect to another node whose port is not opened yet the first request will throw an exception, but the second request to this node will not time out.

It seems that when this exception is thrown:

object.Exception@submodules/vibe-core/source/vibe/core/net.d(229): Failed to connect to 192.168.144.7:5826: refused

The first request to the given address failed will this exception, however the second request does not throw this exception. The request doesn't time out either.


I need to add more logging and see what actually happens to the second request to the closed port. I think this is a vibe.d bug.

@hewison-chris
Copy link
Contributor

I found what triggers the bug. Here's my debugging log:

# main.d failure for node 7 at 05:06:06
core.exception.AssertError@tests/system/source/main.d(42): Check condition failed after timeout of 1 minute and 600 attempts2020-Aug-18 05:06:06.941664Z: Node 6 (this is node-7_1) has not completed discovery after 1 minute.

# node 7 last message at 05:05:04 (therefore thread did not freeze)
node-7_1  | 2020-08-18T05:05:04.737464254Z 2020-08-18 05:05:04,733 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-7:7826

# node 7 still missing node 5 at 05:05:03,734:
node-7_1  | 2020-08-18T05:05:04.737445312Z 2020-08-18 05:05:03,734 Info [agora.network.NetworkManager] - Found validator (1 missing - [GDNODE5T7TWJ2S4UQSTM7KDHU2HQHCJUXFYLPZDDYGXIBUAH3U3PJQC2])

# node 5 opened port at 05:05:03,630
node-5_1  | 2020-08-18T05:05:03.630845791Z 2020-08-18 05:05:03,630 Info [agora.node.Runner] - About to listen to HTTP: 5826

# node 7 called public key on 05:05:02,703
node-7_1  | 2020-08-18T05:05:02.704400058Z 2020-08-18 05:05:02,703 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-5:5826
node-7_1  | 2020-08-18T05:05:02.726375356Z 2020-08-18 05:05:02,725 Info [agora.network.NetworkClient] - Exception calling getPublicKey on http://node-5:5826: object.Exception@submodules/vibe-core/source/vibe/core/net.d(229): Failed to connect to 192.168.144.7:5826: refused
node-7_1  | 2020-08-18T05:05:02.726378277Z 2020-08-18 05:05:02,725 Info [agora.network.NetworkClient] - Will retry getPublicKey to http://node-5:5826 in 50 ms

# node 7 tries again the same request at  05:05:02,777
node-7_1  | 2020-08-18T05:05:02.780841934Z 2020-08-18 05:05:02,777 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-5:5826

# However there was no exception / timeout for this second request. There are many other messages about other requests so the thread did not freeze.

tl;dr: The nodes each open their ports at different times (sometimes up to a 5 second difference - because of the way docker-compose works). Then when one node tries to connect to another node whose port is not opened yet the first request will throw an exception, but the second request to this node will not time out.

It seems that when this exception is thrown:

object.Exception@submodules/vibe-core/source/vibe/core/net.d(229): Failed to connect to 192.168.144.7:5826: refused

The first request to the given address failed will this exception, however the second request does not throw this exception. The request doesn't time out either.

I need to add more logging and see what actually happens to the second request to the closed port. I think this is a vibe.d bug.

So if we waited for 10 secs before trying to connect to any nodes we probably would not fail on first request.

@AndrejMitrovic
Copy link
Contributor Author

Yes but we have request retries in place, so this should all be handled gracefully already.

@hewison-chris
Copy link
Contributor

I just meant to prove your theory that second attempts are not being handled correctly. Although I guess we should add a test to check the case where a node comes on line after the first check has happened. This way we can reproduce the error without running multiple times until it fails.

@AndrejMitrovic
Copy link
Contributor Author

Yes it would be good to reproduce it properly.

@AndrejMitrovic
Copy link
Contributor Author

This call never returns on the second request to the node: https://github.com/vibe-d/vibe-core/blob/b417214e325a7fb684817b4ae7b218eec46c11c1/source/vibe/core/net.d#L35

I've notice that resolveHost does not use any timeouts here: https://github.com/vibe-d/vibe-core/blob/b417214e325a7fb684817b4ae7b218eec46c11c1/source/vibe/core/net.d#L69

It's odd that a timeout is not used there but is actually used in the call to connectTCPWithTimeout:

https://github.com/bpfkorea/vibe.d/blob/35079d4cf348e36785783480535ff186f2e4ef02/http/vibe/http/client.d#L703-L706

It seems to me that resolveHost should also take a timeout parameter.

@AndrejMitrovic
Copy link
Contributor Author

AndrejMitrovic commented Aug 18, 2020

I've asked Sonke: vibe-d/vibe.d#2347 (comment)

Edit: He agrees that we should add a timeout.

@AndrejMitrovic
Copy link
Contributor Author

This commit fixes changes resolveHost to use a timeout: d416462

It seems to make CircleCI go green fairly reliably.

However I still need to figure a better workaround for this: AndrejMitrovic/eventcore@862b5d4...2b66a20#diff-5515d1895d07303e45a19fac726f59ec. Without this workaround an assertion is triggered here: https://github.com/AndrejMitrovic/eventcore/blob/2b66a20961d3743bfeaeb08e87db71d970c653b7/source/eventcore/drivers/posix/dns.d#L76 - which means the handle that was returned from allocateHandle is reused - my workaround just makes sure there is always a unique handle returned.

Of course the underlying problem is still there, namely resolveHost seems to block the fiber forever.

And also there is still sometimes a failure on the line assert(getHeight == height);. But that seems to be a different problem - FullNode doesn't accept an externalized block (#1117 (comment)).

@AndrejMitrovic
Copy link
Contributor Author

I opened some PRs:

However it might take some time until these are resolved & merged. Maybe we should just pause running the full system integration tests for the time being.

AndrejMitrovic added a commit to AndrejMitrovic/agora that referenced this issue Aug 19, 2020
This still does a basic build of Agora and runs -help
on it, but stops short of testing the nodes until
the issues in bosagora#1117 are resolved.
@bpalaggi bpalaggi added the C.General An issue which doesn't fit in the other categories (not blockchain related) label Sep 1, 2020
@Geod24
Copy link
Collaborator

Geod24 commented Oct 26, 2020

Are we still seeing failures here ?

@Geod24
Copy link
Collaborator

Geod24 commented Jan 28, 2021

CircleCI has been stable for a few months, closing.

@Geod24 Geod24 closed this as completed Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C.General An issue which doesn't fit in the other categories (not blockchain related) type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense
Projects
None yet
Development

No branches or pull requests

4 participants