Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Improve CRDT convergence #438

Merged
merged 3 commits into from
Jun 26, 2018
Merged

Conversation

carllin
Copy link
Contributor

@carllin carllin commented Jun 25, 2018

Instead of randomly choosing a peer to get gossip updates from, this PR adds an option to weight that choice by stake.

Fixes #302

@garious garious added the CI Pull Request is ready to enter CI label Jun 25, 2018
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Jun 25, 2018
@aeyakovenko
Copy link
Member

@carllin @pgarg66 I think we need a testbed for this stuff. like running 1k+ crdt nodes and verifying convergence on 100 machine instances.

@pgarg66
Copy link
Contributor

pgarg66 commented Jun 25, 2018

Agree. I am trying to create a test bed on GCP for overall network testing.
@carllin, I'll keep you in loop. Any suggestions are welcome.

Copy link
Contributor

@garious garious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great stuff! I offer lots of nits here, but please don't assume that implies I don't like the PR. I had an easy time following your code - thanks so much! Let's polish this one up and get it merged!


let result = weighted_strategy.calculate_weighted_remote_index(key1);

// If nobody has seen a newer update then rever to default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

@@ -0,0 +1,306 @@
use crdt::ReplicatedData;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add module documentation?

pub const DEFAULT_WEIGHT: u32 = 1;

pub trait ChooseGossipPeerStrategy {
fn choose_peer(&self, options: Vec<&ReplicatedData>) -> Result<ReplicatedData>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to return a reference here to avoid the clone()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @aeyakovenko, what do you think about the use of the term "peer" here? Should we consider replacing ReplicatedData with Peer?

Copy link
Contributor Author

@carllin carllin Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garious, yeah I thought the reference would have been fine, but in the original gossip_request() code, there was a clone(), so I aligned with that. https://github.com/solana-labs/solana/blob/master/src/crdt.rs#L520. Looking at it more in depth now, it doesn't seem necessary.

fn choose_peer(&self, options: Vec<&ReplicatedData>) -> Result<ReplicatedData>;
}

pub struct ChooseRandomPeerStrategy<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment that summarizes this strategy and its limitations?

}
}

pub struct ChooseWeightedPeerStrategy<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment that summarizes this strategy and how it overcomes the limitations of ChooseRandomPeerStrategy?

trace!("waiting to converge:");
let mut done = false;
for _ in 0..30 {
done = c1.read().unwrap().table.len() == 3 && c2.read().unwrap().table.len() == 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old version of cargo fmt - this will fail CI. If you rustup update stable, it'll automatically update rustfmt along with the compiler.

// Make sure liveness table entry contains correct result for c2
let c2_index_result_for_c4 = liveness_map.get(&c2_id);
assert!(c2_index_result_for_c4.is_some());
assert!(*(c2_index_result_for_c4.unwrap()) == c2_index_for_c4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_eq!

// Make sure liveness table entry contains correct result for c3
let c3_index_result_for_c4 = liveness_map.get(&c3_id);
assert!(c3_index_result_for_c4.is_some());
assert!(*(c3_index_result_for_c4.unwrap()) == c3_index_for_c4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_eq!

threads.extend(dr2.thread_hdls.into_iter());
threads.extend(dr3.thread_hdls.into_iter());

for t in threads.into_iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for that into_iter()

threads.extend(dr1.thread_hdls.into_iter());
threads.extend(dr4.thread_hdls.into_iter());

for t in threads.into_iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for that into_iter()

@garious garious changed the title Issue #302: Crdt convergence Improve CRDT convergence Jun 25, 2018
@garious
Copy link
Contributor

garious commented Jun 25, 2018

I took a stab at updating the title and description. Feel free to update that again. A description that explains how the resulting system is somehow "better" would be helpful context for anyone else reviewing the PR. The ticket says "Improve convergence", but in what way? More secure? Faster? Perhaps this is a question for @aeyakovenko.

@aeyakovenko
Copy link
Member

aeyakovenko commented Jun 25, 2018

@garious @carllin ideally we would have a test that can measure how fast information propagates through the network.

  1. membership of a new node
  2. old node leaving the network
  3. update to a RD structure (NodeNetworkInfo maybe?). Lets rename after 0.7.0 is out
  4. how stable 1 and 2 are during churn

@garious
Copy link
Contributor

garious commented Jun 25, 2018

@aeyakovenko, in that list of metrics, which would you assume would change as a result of this PR?

@aeyakovenko
Copy link
Member

in theory all 3, but we need some tests to experiment with this :). @carllin @pgarg66 how is your knowledge of http://man7.org/linux/man-pages/man8/tc-netem.8.html

I think we can sim gossip on a large set of nodes on just a single big machine with lots of cpu cores. since all the threads should only be doing network IO.

@pgarg66
Copy link
Contributor

pgarg66 commented Jun 25, 2018

I've used netem in past. I was planning to use it to induce network errors.

@carllin
Copy link
Contributor Author

carllin commented Jun 26, 2018

I haven't used netem before, but would sign up as a willing disciple/guinea pig of the wise one: @pgarg66

@garious garious added the CI Pull Request is ready to enter CI label Jun 26, 2018
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label Jun 26, 2018
@garious garious merged commit 551f639 into solana-labs:master Jun 26, 2018
@garious
Copy link
Contributor

garious commented Jun 26, 2018

Thanks a bunch @carllin. Let's keep this going. Great code, great direction. Now the challenge is finding ways to measure all the value you're adding.

vkomenda pushed a commit to vkomenda/solana that referenced this pull request Aug 29, 2021
…#438)

* Add failing tests

* No need to hold source RefMut in process_close_account and process_toggle_freeze_account
vkomenda pushed a commit to vkomenda/solana that referenced this pull request Aug 29, 2021
alessandrod pushed a commit to alessandrod/solana that referenced this pull request Mar 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants