Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QuorumStore] Fix batch requester #7190

Merged
merged 15 commits into from
Mar 17, 2023
Merged

[QuorumStore] Fix batch requester #7190

merged 15 commits into from
Mar 17, 2023

Conversation

danielxiangzl
Copy link
Contributor

@danielxiangzl danielxiangzl commented Mar 15, 2023

Description

Fix the batch requester to send batch requests to peers every retry_timeout interval (via RPC request of rpc_timeout) up to retry_limit times. Fix related counters, and move related parameters to the QS config file.

Test Plan

@danielxiangzl danielxiangzl changed the title [QuorumStore] Fix counter [QuorumStore] Fix batch requester Mar 16, 2023
}
} else {
// retry_limit exceeds, fail to get batch from peers
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so if we break here, the last set of peers gets only 1 second to receive RPC responses? Instead of a break immediately, is the behavior we want to wait until the futures are drained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Resolved.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

for peer in request_peers {
futures.push(network_sender.request_batch(request.clone(), peer, rpc_timeout));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work?

else if (futures.len() == 0) {
  break;
}

Having another num_futures just feels a bit brittle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks

let network_sender = self.network_sender.clone();
let request_num_peers = self.request_num_peers;
let my_peer_id = self.my_peer_id;
let epoch = self.epoch;
let timeout = Duration::from_millis(self.request_timeout_ms as u64);
let retry_timeout = Duration::from_millis(self.retry_timeout_ms as u64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is confusing to me, we probably should call it "retry_interval" or something similar

num_futures -= 1;
if let Ok(batch) = response {
counters::RECEIVED_BATCH_RESPONSE_COUNT.inc();
if batch.verify().is_ok() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a rebase

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

if self.num_retries == 0 {
let mut rng = rand::thread_rng();
// make sure nodes request from the different set of nodes
self.next_index = rng.gen::<usize>() % self.signers.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are peers shuffled or sorted? If sorted, can it be a problem that we are using a consecutive range?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's shuffled

proof.shuffled_signers(&self.validator_verifier),

if self.num_retries == 0 {
let mut rng = rand::thread_rng();
// make sure nodes request from the different set of nodes
self.next_index = rng.gen::<usize>() % self.signers.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's shuffled

proof.shuffled_signers(&self.validator_verifier),

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite land_blocking success on 925f4c14651126f86cef0897cfaae520d6e5ac36

performance benchmark with full nodes : 5771 TPS, 6883 ms latency, 10800 ms p99 latency,no expired txns
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 925f4c14651126f86cef0897cfaae520d6e5ac36

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 925f4c14651126f86cef0897cfaae520d6e5ac36 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7877 TPS, 4900 ms latency, 6900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 925f4c14651126f86cef0897cfaae520d6e5ac36
compatibility::simple-validator-upgrade::single-validator-upgrade : 5086 TPS, 7721 ms latency, 9500 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 925f4c14651126f86cef0897cfaae520d6e5ac36
compatibility::simple-validator-upgrade::half-validator-upgrade : 4826 TPS, 8491 ms latency, 11100 ms p99 latency,no expired txns
4. upgrading second batch to new version: 925f4c14651126f86cef0897cfaae520d6e5ac36
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6747 TPS, 5779 ms latency, 9000 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 925f4c14651126f86cef0897cfaae520d6e5ac36 passed
Test Ok

@danielxiangzl danielxiangzl merged commit faa75d2 into main Mar 17, 2023
@danielxiangzl danielxiangzl deleted the daniel-qs-counter branch March 17, 2023 06:53
@github-actions
Copy link
Contributor

✅ Forge suite framework_upgrade success on cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 925f4c14651126f86cef0897cfaae520d6e5ac36

Compatibility test results for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 925f4c14651126f86cef0897cfaae520d6e5ac36 (PR)
Upgrade the nodes to version: 925f4c14651126f86cef0897cfaae520d6e5ac36
framework_upgrade::framework-upgrade::full-framework-upgrade : 6878 TPS, 5698 ms latency, 7700 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 925f4c14651126f86cef0897cfaae520d6e5ac36 passed
Test Ok

bchocho pushed a commit that referenced this pull request Mar 17, 2023
* fix batch request counter, add retry limit to config

* new batch requester
bchocho added a commit that referenced this pull request Mar 17, 2023
* fix batch request counter, add retry limit to config

* new batch requester

Co-authored-by: danielx <[email protected]>
@bchocho bchocho mentioned this pull request Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants