Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Node table limiting and cache for node filter #10288

Merged
merged 15 commits into from
Apr 5, 2019

Conversation

VladLupashevskyi
Copy link
Contributor

After removing node filter cache (#10121), the nodes were still loosing the the peers, however in this case the time when it starts to loose them was very dependent on the CPU power, so when I was testing the fix it was working well for couple of days on my i7 4th gen, but when I left for longer time peers loosing started to occur again.

I have done some research and found out that the problem was in this line:
https://github.com/paritytech/parity-ethereum/blob/06cae8a53556dede67e163325657bee3fba8feda/util/network-devp2p/src/host.rs#L598

where it was querying the contract on every state. But the interesting thing I found that the node table was not limited in size and was always growing. So after 4 hrs of running there were already 10k entries, so the state was queried 10k times each second. In this case introducing the cache is not really a solution, because once we hit the limit of it the querying starts again.

First of all I thought I found a typo in update_table fn where ; was missing after for loop:
https://github.com/paritytech/parity-ethereum/blob/06cae8a53556dede67e163325657bee3fba8feda/util/network-devp2p/src/node_table.rs#L349
But because of lack of experience in rust adding ; was not helpful.

So I decided to introduce the node table limiting which would allow to use caching for Node Filter, but also guarantee more predictable behaviour for nodes that don't use it, so the large size won't be causing any performance issues when node runs for very long time.

The limiting is done in the way that in case of adding a new node and hitting the limit, the least important node gets removed (either failure one with the most recent last_connected state, or node with unknown last_connected state, or the node with the oldest successful last_connected state).

I introduced simple cache for Node Filter based on RwLock which is reset on the new block. It's not a perfect solution, but I would say more generic, and especially useful for the chains where the filter gets dynamically updated. As further improvement I would see an opportunity to introduce a new version of node permission contract where the event NodeTableUpdated would be emitted when the list of allowed nodes is updated and clear the cache in this case, or for current contract version - checking the txes to the node permission contract on each block and clear the cache if they are present there (not sure how it would affect the performance though, but I believe it's better than just clearing the cache on each block).

I have done some checks and on current 2.3.0-beta when node filter contract is enabled, the CPU has 30% of load in the beginning and keeps growing, with this update the CPU load is constantly about 15-30% on AuRa chain with block time 4 seconds. When the contract is disabled the load is 5-10%.

Looking forward to get any comments or suggestions to these changes :)

@parity-cla-bot
Copy link

It looks like @VladLupashevskyi signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

1 similar comment
@parity-cla-bot
Copy link

It looks like @VladLupashevskyi signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

@5chdn 5chdn added this to the 2.4 milestone Feb 7, 2019
@5chdn 5chdn added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. M2-config 📂 Chain specifications and node configurations. labels Feb 7, 2019
Copy link
Contributor

@HCastano HCastano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick peek through the code and it looks like there are a few .unwrap() calls throughout. As a general rule we try and avoid panics, so you're either going to need to:

a) Remove the call to .unwrap()and handle the failure case more gracefully
b) Write a little proof stating why this will not panic, and include that within an .expect() call

Copy link
Contributor

@ascjones ascjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Just realised I had left this review in progress, but similar comments to @HCastano regarding unwraps.

util/network-devp2p/src/node_table.rs Outdated Show resolved Hide resolved
util/network-devp2p/src/node_table.rs Outdated Show resolved Hide resolved
util/network-devp2p/src/node_table.rs Outdated Show resolved Hide resolved
util/network-devp2p/src/node_table.rs Outdated Show resolved Hide resolved
util/network-devp2p/src/node_table.rs Show resolved Hide resolved
ethcore/node-filter/src/lib.rs Show resolved Hide resolved

use_contract!(peer_set, "res/peer_set.json");

/// Connection filter that uses a contract to manage permissions.
pub struct NodeFilter {
client: Weak<BlockChainClient>,
contract_address: Address,
cache: RwLock<Cache>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we understand the real reason for CPU overload (not limited Node table), can we just rollback #10143 and return previous LRUCache, that was used here? I frankly don't see the advantages of proposed solution in comparison with the previous one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of implementing it via RwLock was to make it possible to read the cache from different threads without blocking each other.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Than may be just change previous Mutex to RwLock? Because order field in Cache is just the implementation of Lru

Copy link
Contributor Author

@VladLupashevskyi VladLupashevskyi Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I think that makes sense. Sorry I didn't notice and I was thinking that LruCache comes already with built-in Mutex 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I cannot really use LruCache since it does not provide non-mutable get method, so it's not possible to use RwLock with it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grbIzl
Copy link
Collaborator

grbIzl commented Feb 12, 2019

@VladLupashevskyi In this part of code: https://github.com/paritytech/parity-ethereum/blob/d89b8d904ff28382d3692290a936706304627255/util/network-devp2p/src/host.rs#L595 we create an iterator over the node table and take some amount of nodes from it (~30 nodes). So the only scenario, when such performance degradation is possible, is when you have a lot of nodes forbidden in your permission contract and iterator goes over them, looking for allowed. I think, that it will be true for whitelisting permission contracts. For blacklisting contracts this code should return the required amount of nodes quicker.

@VladLupashevskyi
Copy link
Contributor Author

@VladLupashevskyi In this part of code:

parity-ethereum/util/network-devp2p/src/host.rs

Line 595 in d89b8d9

for id in nodes.filter(|id|
we create an iterator over the node table and take some amount of nodes from it (~30 nodes). So the only scenario, when such performance degradation is possible, is when you have a lot of nodes forbidden in your permission contract and iterator goes over them, looking for allowed. I think, that it will be true for whitelisting permission contracts. For blacklisting contracts this code should return the required amount of nodes quicker.

Yes, you are absolutely right - I've experienced this behaviour when was using whitelisting permissioning in the contract. But shouldn't this contract have a generic design, allowing to do whitelisting and blacklisting depending on the design of the blockchain? I believe that first is very crucial in permissioned consortium chains.

@grbIzl
Copy link
Collaborator

grbIzl commented Feb 12, 2019

But shouldn't this contract have a generic design, allowing to do whitelisting and blacklisting depending on the design of the blockchain? I believe that first is very crucial in permissioned consortium chains.

I'm not arguing, it was just an analysis to help better understand the issue. We indeed must support any type of permissioning contract.

@VladLupashevskyi
Copy link
Contributor Author

Here I fixed some bugs with get_index_to_insert and added tests for it. I think the most appropriate place to test this function is in table_last_contact_order test, so in order to prevent its failing on macos due to lost nanoseconds precision I had to revert the fix 5a1dc3e was done by @debris, because the test become more complicated there and use sleep(Duration::from_micros(1)) instead as simple fix. I think it shouldn't affect the test execution time, however the code become slightly bulky.

@VladLupashevskyi
Copy link
Contributor Author

I also removed get_mut method for NodeTable and introduced get instead to prevent occasional modification of the nodes HashMap.

@5chdn 5chdn modified the milestones: 2.4, 2.5, 2.6 Feb 21, 2019
@soc1c soc1c requested a review from ascjones February 22, 2019 14:35
@VladLupashevskyi
Copy link
Contributor Author

Could somebody review this? We are looking forward to that PR being reviewed and merged if possible :)

Copy link
Collaborator

@grbIzl grbIzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with these changes, but need more eyes for node_table new logic. @ngotchac can you help with one more review?

@grbIzl grbIzl requested a review from ngotchac March 28, 2019 10:17
@soc1c soc1c modified the milestones: 2.5, 2.6 Apr 2, 2019
@soc1c soc1c added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Apr 3, 2019
@soc1c soc1c merged commit 8132d38 into openethereum:master Apr 5, 2019
ordian added a commit that referenced this pull request Apr 15, 2019
* master:
  CI improvements (#10579)
  Watch transactions pool (#10558)
  fix(evmbin): make benches compile again (#10586)
  fix issue with compilation when 'slow-blocks' feature enabled (#10585)
  Reject crazy timestamps instead of truncating. (#10574)
  Node table limiting and cache for node filter (#10288)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. M2-config 📂 Chain specifications and node configurations. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants