List of issues causing CircleCI failures #1117

AndrejMitrovic · 2020-08-13T12:27:36Z

`HTTPStatusException` was not propagated properly (Fixed in #1109)

Properly handle `HTTPStatusException` (Fixed in #1106)

Request retry delays and timeouts might be too high (Fixed in #1118)

Some further ideas to improve the CircleCI succeeding:

Reduce the retry_delay option. This is essentially the "sleeping time" between each time a request is retried. Currently it's set to 1.5 seconds. I think this is too long of a waiting time between retries.
Increase the max_retries option so a node doesn't get prematurely banned.
Change the ban manager logging to use .info instead of .trace. Right now our loggers only emit info level messages, but the ban manager still uses trace level - so if a node does get banned we won't see it in the logs in our CI.

IP & Port when calling `registerListener` might be incorrect

As mentioned here: #1115 (comment)

However the following issue might be related:

`listenHTTP` is called too late: (Fixed in #1118)

node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,996 Info [agora.node.Validator] - start()
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,996 Info [agora.node.Validator] - startPeriodicDiscovery
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,996 Info [agora.node.Validator] - startPeriodicDiscovery inside task
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,996 Info [agora.network.NetworkManager] - Doing network discovery with peer keys: { _set: [ GDNODE4KTE7VQUHVBLXIGD7VEFY57X4XV547P72D37SDG7UEO7MWOSNY: true, GDNODE6ZXW2NNOOQIGN24MBEZRO5226LSMHGQA3MUAMYQSTJVR7XT6GH: true, GDNODE3EWQKF33TPK35DAQ3KXAYSOT4E4ACDOVJMDZQDVKP66IMJEACM: true, GDNODE5T7TWJ2S4UQSTM7KDHU2HQHCJUXFYLPZDDYGXIBUAH3U3PJQC2: true, GDNODE7J5EUK7T6HLEO2FDUBWZEXVXHJO7C4AF5VZAKZENGQ4WR3IX2U: true ] }
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,997 Info [agora.network.NetworkManager] - About to connect to: http://node-3:3826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,997 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-3:3826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,997 Info [agora.network.NetworkManager] - About to connect to: http://node-4:4826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:33,997 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-4:4826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,06 Info [agora.network.NetworkManager] - About to connect to: http://node-0:1826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,06 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-0:1826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,06 Info [agora.network.NetworkManager] - About to connect to: http://node-5:5826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,06 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-5:5826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,08 Info [agora.network.NetworkManager] - About to connect to: http://node-6:6826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,08 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-6:6826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,15 Info [agora.network.NetworkManager] - About to connect to: http://node-7:7826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,15 Info [agora.network.NetworkManager] - - getPublicKey to: http://node-7:7826
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,15 Info [agora.node.Validator] - ~startPeriodicDiscovery
node-2_1_d220995f0bb5 | 2020-08-13 12:13:34,15 Info [agora.node.Runner] - About to listen to HTTP: 2826

Here the node attempts to connect with all other nodes, but only starts listening to requests after the first discover() call completes.

This comment is wrong:

https://github.com/bpfkorea/agora/blob/331e51895b9faff1b591c2dcce9b93611d7a06a2/source/agora/node/Runner.d#L89

In fact this call is not asynchronous and it actually causes a context switch here: https://github.com/bpfkorea/agora/blob/331e51895b9faff1b591c2dcce9b93611d7a06a2/source/agora/node/Validator.d#L207

Then the node waits 5 seconds before attempting network discovery again:
https://github.com/bpfkorea/agora/blob/331e51895b9faff1b591c2dcce9b93611d7a06a2/source/agora/node/Validator.d#L212

The text was updated successfully, but these errors were encountered:

This makes the node open a listening port right away, rather than waiting for the first discovery round to complete. It increases the node's connectivity to other nodes which may want to establish a connection, and consequently reduces CI failures. Part of: bosagora#1117

AndrejMitrovic · 2020-08-13T14:51:18Z

Some further ideas to improve the CircleCI succeeding:

Do some performance measurements. Is there a particular reason why a request timed-out? Or is a Node's fiber stuck doing some heavy processing and stalling all other fibers? Etc.

The runTask() call will spawn a new fiber and suspend the calling fiber and switch to the new fiber. Using setTimer with a zero timeout (a.k.a. schedule()) makes the node open a listening port right away, rather than waiting for the first discovery round to complete. It increases the node's connectivity to other nodes which may want to establish a connection, and consequently reduces CI failures. Part of: bosagora#1117

The runTask() call will spawn a new fiber and and switch to the new fiber immediately. However we want to start listening for incoming requests before doing network discovery. Using setTimer with a zero timeout (a.k.a. schedule()) makes the node open a listening port right away, rather than waiting for the first discovery round to complete. It increases the node's connectivity to other nodes which may want to establish a connection, and consequently reduces CI failures. Part of: bosagora#1117

The runTask() call will spawn a new fiber and and switch to the new fiber immediately. However we want to start listening for incoming requests before doing network discovery. Using setTimer with a zero timeout (a.k.a. schedule()) makes the node open a listening port right away, rather than waiting for the first discovery round to complete. It increases the node's connectivity to other nodes which may want to establish a connection, and consequently reduces CI failures. Part of: #1117

AndrejMitrovic · 2020-08-14T04:53:35Z

I think I see another issue:

https://github.com/bpfkorea/agora/blob/2f4efaa190d0d2a5c42b79d183d9c11222716976/source/agora/node/Validator.d#L210-L212

After the initial network discovery is complete, 5 seconds later the same set of keys will be passed to the node.

But in the NetworkManager we don't check if we're already connected to the nodes with the given keys: https://github.com/bpfkorea/agora/blob/2f4efaa190d0d2a5c42b79d183d9c11222716976/source/agora/network/NetworkManager.d#L421

AndrejMitrovic · 2020-08-14T05:18:20Z

Another issue: Sometimes a FullNode rejects a block received from getBlocksFrom. Unfortunately the log item was a trace call and hidden from the integration tests:

log.trace("Rejected block: {}: {}", fail_reason, block.prettify());

AndrejMitrovic · 2020-08-14T07:59:20Z

Here one node tries to connect to another node and call getPublicKey:

node-3_1_6cb3b1c7873b | 2020-08-14T07:42:45.038341153Z 2020-08-14 07:42:45,37 Info [agora.network.NetworkManager] - ConnectionTask: getPublicKey to: http://node-5:5826

However the node at this port only starts listening 3 seconds later:

node-5_1_6d09f00939f4 | REST route: POST2020-08-14 07:42:48,732 Info [agora.node.Runner] - About to listen to HTTP: 5826

Now the question is what happened to the timeouts..

AndrejMitrovic · 2020-08-18T05:18:30Z

I found what triggers the bug. Here's my debugging log:

# main.d failure for node 7 at 05:06:06
core.exception.AssertError@tests/system/source/main.d(42): Check condition failed after timeout of 1 minute and 600 attempts2020-Aug-18 05:06:06.941664Z: Node 6 (this is node-7_1) has not completed discovery after 1 minute.

# node 7 last message at 05:05:04 (therefore thread did not freeze)
node-7_1  | 2020-08-18T05:05:04.737464254Z 2020-08-18 05:05:04,733 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-7:7826

# node 7 still missing node 5 at 05:05:03,734:
node-7_1  | 2020-08-18T05:05:04.737445312Z 2020-08-18 05:05:03,734 Info [agora.network.NetworkManager] - Found validator (1 missing - [GDNODE5T7TWJ2S4UQSTM7KDHU2HQHCJUXFYLPZDDYGXIBUAH3U3PJQC2])

# node 5 opened port at 05:05:03,630
node-5_1  | 2020-08-18T05:05:03.630845791Z 2020-08-18 05:05:03,630 Info [agora.node.Runner] - About to listen to HTTP: 5826

# node 7 called public key on 05:05:02,703
node-7_1  | 2020-08-18T05:05:02.704400058Z 2020-08-18 05:05:02,703 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-5:5826
node-7_1  | 2020-08-18T05:05:02.726375356Z 2020-08-18 05:05:02,725 Info [agora.network.NetworkClient] - Exception calling getPublicKey on http://node-5:5826: object.Exception@submodules/vibe-core/source/vibe/core/net.d(229): Failed to connect to 192.168.144.7:5826: refused
node-7_1  | 2020-08-18T05:05:02.726378277Z 2020-08-18 05:05:02,725 Info [agora.network.NetworkClient] - Will retry getPublicKey to http://node-5:5826 in 50 ms

# node 7 tries again the same request at  05:05:02,777
node-7_1  | 2020-08-18T05:05:02.780841934Z 2020-08-18 05:05:02,777 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-5:5826

# However there was no exception / timeout for this second request. There are many other messages about other requests so the thread did not freeze.

tl;dr: The nodes each open their ports at different times (sometimes up to a 5 second difference - because of the way docker-compose works). Then when one node tries to connect to another node whose port is not opened yet the first request will throw an exception, but the second request to this node will not time out.

It seems that when this exception is thrown:

object.Exception@submodules/vibe-core/source/vibe/core/net.d(229): Failed to connect to 192.168.144.7:5826: refused

The first request to the given address failed will this exception, however the second request does not throw this exception. The request doesn't time out either.

I need to add more logging and see what actually happens to the second request to the closed port. I think this is a vibe.d bug.

hewison-chris · 2020-08-18T05:54:24Z

I found what triggers the bug. Here's my debugging log:

# main.d failure for node 7 at 05:06:06
core.exception.AssertError@tests/system/source/main.d(42): Check condition failed after timeout of 1 minute and 600 attempts2020-Aug-18 05:06:06.941664Z: Node 6 (this is node-7_1) has not completed discovery after 1 minute.

# node 7 last message at 05:05:04 (therefore thread did not freeze)
node-7_1  | 2020-08-18T05:05:04.737464254Z 2020-08-18 05:05:04,733 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-7:7826

# node 7 still missing node 5 at 05:05:03,734:
node-7_1  | 2020-08-18T05:05:04.737445312Z 2020-08-18 05:05:03,734 Info [agora.network.NetworkManager] - Found validator (1 missing - [GDNODE5T7TWJ2S4UQSTM7KDHU2HQHCJUXFYLPZDDYGXIBUAH3U3PJQC2])

# node 5 opened port at 05:05:03,630
node-5_1  | 2020-08-18T05:05:03.630845791Z 2020-08-18 05:05:03,630 Info [agora.node.Runner] - About to listen to HTTP: 5826

# node 7 called public key on 05:05:02,703
node-7_1  | 2020-08-18T05:05:02.704400058Z 2020-08-18 05:05:02,703 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-5:5826
node-7_1  | 2020-08-18T05:05:02.726375356Z 2020-08-18 05:05:02,725 Info [agora.network.NetworkClient] - Exception calling getPublicKey on http://node-5:5826: object.Exception@submodules/vibe-core/source/vibe/core/net.d(229): Failed to connect to 192.168.144.7:5826: refused
node-7_1  | 2020-08-18T05:05:02.726378277Z 2020-08-18 05:05:02,725 Info [agora.network.NetworkClient] - Will retry getPublicKey to http://node-5:5826 in 50 ms

# node 7 tries again the same request at  05:05:02,777
node-7_1  | 2020-08-18T05:05:02.780841934Z 2020-08-18 05:05:02,777 Info [agora.network.NetworkClient] - Calling getPublicKey on http://node-5:5826

# However there was no exception / timeout for this second request. There are many other messages about other requests so the thread did not freeze.

tl;dr: The nodes each open their ports at different times (sometimes up to a 5 second difference - because of the way docker-compose works). Then when one node tries to connect to another node whose port is not opened yet the first request will throw an exception, but the second request to this node will not time out.

It seems that when this exception is thrown:

object.Exception@submodules/vibe-core/source/vibe/core/net.d(229): Failed to connect to 192.168.144.7:5826: refused

The first request to the given address failed will this exception, however the second request does not throw this exception. The request doesn't time out either.

I need to add more logging and see what actually happens to the second request to the closed port. I think this is a vibe.d bug.

So if we waited for 10 secs before trying to connect to any nodes we probably would not fail on first request.

AndrejMitrovic · 2020-08-18T06:00:09Z

Yes but we have request retries in place, so this should all be handled gracefully already.

hewison-chris · 2020-08-18T06:03:36Z

I just meant to prove your theory that second attempts are not being handled correctly. Although I guess we should add a test to check the case where a node comes on line after the first check has happened. This way we can reproduce the error without running multiple times until it fails.

AndrejMitrovic · 2020-08-18T06:04:53Z

Yes it would be good to reproduce it properly.

AndrejMitrovic · 2020-08-18T07:14:23Z

This call never returns on the second request to the node: https://github.com/vibe-d/vibe-core/blob/b417214e325a7fb684817b4ae7b218eec46c11c1/source/vibe/core/net.d#L35

I've notice that resolveHost does not use any timeouts here: https://github.com/vibe-d/vibe-core/blob/b417214e325a7fb684817b4ae7b218eec46c11c1/source/vibe/core/net.d#L69

It's odd that a timeout is not used there but is actually used in the call to connectTCPWithTimeout:

https://github.com/bpfkorea/vibe.d/blob/35079d4cf348e36785783480535ff186f2e4ef02/http/vibe/http/client.d#L703-L706

It seems to me that resolveHost should also take a timeout parameter.

AndrejMitrovic · 2020-08-18T07:18:44Z

I've asked Sonke: vibe-d/vibe.d#2347 (comment)

Edit: He agrees that we should add a timeout.

AndrejMitrovic · 2020-08-18T10:16:47Z

This commit fixes changes resolveHost to use a timeout: d416462

It seems to make CircleCI go green fairly reliably.

However I still need to figure a better workaround for this: AndrejMitrovic/eventcore@862b5d4...2b66a20#diff-5515d1895d07303e45a19fac726f59ec. Without this workaround an assertion is triggered here: https://github.com/AndrejMitrovic/eventcore/blob/2b66a20961d3743bfeaeb08e87db71d970c653b7/source/eventcore/drivers/posix/dns.d#L76 - which means the handle that was returned from allocateHandle is reused - my workaround just makes sure there is always a unique handle returned.

Of course the underlying problem is still there, namely resolveHost seems to block the fiber forever.

And also there is still sometimes a failure on the line assert(getHeight == height);. But that seems to be a different problem - FullNode doesn't accept an externalized block (#1117 (comment)).

AndrejMitrovic · 2020-08-19T05:48:06Z

I opened some PRs:

However it might take some time until these are resolved & merged. Maybe we should just pause running the full system integration tests for the time being.

This still does a basic build of Agora and runs -help on it, but stops short of testing the nodes until the issues in bosagora#1117 are resolved.

Geod24 · 2020-10-26T08:27:13Z

Are we still seeing failures here ?

Geod24 · 2021-01-28T08:30:37Z

CircleCI has been stable for a few months, closing.

AndrejMitrovic added the type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense label Aug 13, 2020

AndrejMitrovic added this to the 2. Validator milestone Aug 13, 2020

AndrejMitrovic mentioned this issue Aug 13, 2020

Make the node open its ports before doing discovery() #1118

Merged

AndrejMitrovic added a commit to AndrejMitrovic/agora that referenced this issue Aug 19, 2020

Temporarily disable CircleCI tests

e32fc7c

This still does a basic build of Agora and runs -help on it, but stops short of testing the nodes until the issues in bosagora#1117 are resolved.

bpalaggi added the C.General An issue which doesn't fit in the other categories (not blockchain related) label Sep 1, 2020

AndrejMitrovic modified the milestones: 2. Validator, 5. CoinNet release Dec 1, 2020

Geod24 closed this as completed Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of issues causing CircleCI failures #1117

List of issues causing CircleCI failures #1117

AndrejMitrovic commented Aug 13, 2020 •

edited

Loading

AndrejMitrovic commented Aug 13, 2020 •

edited

Loading

AndrejMitrovic commented Aug 14, 2020

AndrejMitrovic commented Aug 14, 2020

AndrejMitrovic commented Aug 14, 2020

AndrejMitrovic commented Aug 18, 2020

hewison-chris commented Aug 18, 2020

AndrejMitrovic commented Aug 18, 2020

hewison-chris commented Aug 18, 2020

AndrejMitrovic commented Aug 18, 2020

AndrejMitrovic commented Aug 18, 2020

AndrejMitrovic commented Aug 18, 2020 •

edited

Loading

AndrejMitrovic commented Aug 18, 2020

AndrejMitrovic commented Aug 19, 2020

Geod24 commented Oct 26, 2020

Geod24 commented Jan 28, 2021

List of issues causing CircleCI failures #1117

List of issues causing CircleCI failures #1117

Comments

AndrejMitrovic commented Aug 13, 2020 • edited Loading

HTTPStatusException was not propagated properly (Fixed in #1109)

Properly handle HTTPStatusException (Fixed in #1106)

Request retry delays and timeouts might be too high (Fixed in #1118)

IP & Port when calling registerListener might be incorrect

listenHTTP is called too late: (Fixed in #1118)

AndrejMitrovic commented Aug 13, 2020 • edited Loading

AndrejMitrovic commented Aug 14, 2020

AndrejMitrovic commented Aug 14, 2020

AndrejMitrovic commented Aug 14, 2020

AndrejMitrovic commented Aug 18, 2020

hewison-chris commented Aug 18, 2020

AndrejMitrovic commented Aug 18, 2020

hewison-chris commented Aug 18, 2020

AndrejMitrovic commented Aug 18, 2020

AndrejMitrovic commented Aug 18, 2020

AndrejMitrovic commented Aug 18, 2020 • edited Loading

AndrejMitrovic commented Aug 18, 2020

AndrejMitrovic commented Aug 19, 2020

Geod24 commented Oct 26, 2020

Geod24 commented Jan 28, 2021

AndrejMitrovic commented Aug 13, 2020 •

edited

Loading

`HTTPStatusException` was not propagated properly (Fixed in #1109)

Properly handle `HTTPStatusException` (Fixed in #1106)

IP & Port when calling `registerListener` might be incorrect

`listenHTTP` is called too late: (Fixed in #1118)

AndrejMitrovic commented Aug 13, 2020 •

edited

Loading

AndrejMitrovic commented Aug 18, 2020 •

edited

Loading