Node permission contract enabling leads to CPU overload and peers loosing #10121

VladLupashevskyi · 2019-01-02T11:41:27Z

Parity Ethereum version: 1.11.11 and 2.1.10
Operating system: MacOS / Linux
Installation: homebrew and one-line installer
Fully synchronized: yes
Restarted: yes

After setting up the chain with node permission contract all parity nodes are loosing all peers within a couple of hours. The contract allows access only to few enode addresses and has the following logic:

function connectionAllowed(bytes32 /*sl*/, bytes32 /*sh*/, bytes32 pl, bytes32 ph) public view returns (bool) {
        bool plAllowed = peerAllowedL[pl];
        bool phAllowed = peerAllowedH[ph];
        return plAllowed && phAllowed;
}

While debugging this issue I found out that when parity has zero peers, one of threads eats 100% CPU, which means that parity does not have enough time to react on trying to connect nodes (which was learned here from similar issues about 0 peers).

After removing nodePermissionContract from the json spec - the connections get stabilized as well as CPU usage.

I went further to the source code and tried to figure out what's going wrong.

I have removed the caching part from the code (LruCache) and the node started to work stable even with node permission contract enabled:

https://github.com/paritytech/parity-ethereum/blob/c077dc652d92535cff0afc9ae096c795c4cf39e3/ethcore/node-filter/src/lib.rs#L87-L90

I do not have too much experience with caching, but it seems to me that either the problem starts to appear when cache gets full or some kind of deadlock happens.

I would add a boot flag like --disable-node-permission-caching in order to provide a quick fix for blockchains that currently rely on this contract.

What do you think about that?

P.S.: During debugging I found out that the connection allowed check happens BEFORE network ID checking, which leads to huge amount of unnecessary calls to the contract. Was this done intentionally? I would consider move the check to the place when network ID is already checked, if that is possible of course!

The text was updated successfully, but these errors were encountered:

jam10o-new · 2019-01-02T12:19:50Z

cc @grbIzl?

tomusdrw · 2019-01-02T18:57:54Z

@VladLupashevskyi

Moving the check after Status message seems reasonable to me, I think we don't exchange any info with uncofirmed peers anyway.
Regarding the cache - I think the issue is caused by the Mutex - it requires exclusive access to check the cache, while the rest of the code could easily run in parallel. Instead of disalbing the cache via CLI flag I would either:
2.1 Disable it completely
2.2. Introduce a different cache behind RwLock:

cache: RwLock<BTreeMap<BlockNumber, HashSet<ConnectionIds>>>,

This would allow us to easily drop entries for old blocks and also drop random entries from HashSet shall it grow too big.
Perhaps we could even parametrize this and allow the cache to persist for multiple blocks (i.e. the on-chain changes would be allowed to take action with a delay). In your case this could be set to infiity drastically increasing the cache hit ratio.

grbIzl · 2019-01-07T12:19:35Z

I have created a PR for removing cache, see #10143
I don't see the significant difference with proposed by @tomusdrw structure (we will still need read and write lock per call).
As for moving this check behind STATUS packet handling: I understand the reasoning, that it will improve the performance. At the same time, it won't comply with general idea of connection filter. Because it's a CONNECTION filter :-) Status packet exchange happens AFTER connection's establishment. I can't predict what happens or can happen after establishing the connection but before STATUS exchange. In general we want to prevent any exchange with such filter.

jam10o-new added this to the 2.4 milestone Jan 2, 2019

grbIzl self-assigned this Jan 3, 2019

grbIzl mentioned this issue Jan 7, 2019

Remove caching for node permissioning check #10143

Merged

ascjones closed this as completed in #10143 Jan 8, 2019

VladLupashevskyi mentioned this issue Feb 3, 2019

Node table limiting and cache for node filter #10288

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node permission contract enabling leads to CPU overload and peers loosing #10121

Node permission contract enabling leads to CPU overload and peers loosing #10121

VladLupashevskyi commented Jan 2, 2019

jam10o-new commented Jan 2, 2019

tomusdrw commented Jan 2, 2019

grbIzl commented Jan 7, 2019

Node permission contract enabling leads to CPU overload and peers loosing #10121

Node permission contract enabling leads to CPU overload and peers loosing #10121

Comments

VladLupashevskyi commented Jan 2, 2019

jam10o-new commented Jan 2, 2019

tomusdrw commented Jan 2, 2019

grbIzl commented Jan 7, 2019