-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase base consensus timeout and make it a config #4532
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Sui has a timeout on sending transaction to the nw consensus and automatically adjusts the timeout based on recent latency from nw it has seen in case of success, or linearly increases in case of timeout. In the stress test network we see quite a few of nw timeouts from the sui transaction submission side, and we indeed [see](http://grafana.shared.internal.sui.io:3000/goto/wqnUWbGVz?orgId=1) that control delay is adjusted accordingly to the timeouts and nw is not stuck. However, it seems that current timeout adjustment logic drops timeout pretty quickly which causes timeouts again. While this is not end of the world, it is noise and probably suboptimal. It also can be noticed that current base timeout value is very close to the actual consensus delay, so the timeout has to be frequently adjusted. Therefore, a short-medium term fix (this PR) is just to bump the delay step for timeout calculations so that it is further away from median nw latency. In the future we can improve timeout calculation by using weighted avg over a longer window instead of weighting current and next timing only.
@@ -152,6 +152,7 @@ impl NodeConfig { | |||
pub struct ConsensusConfig { | |||
pub consensus_address: Multiaddr, | |||
pub consensus_db_path: PathBuf, | |||
pub delay_step: Option<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively I think it may be cleaner if you don't use Option
here but provide a serde default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, too late :)
I don't really have a strong opinion on that, can use default next time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit of serde default is that we can then deploy this without a new config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is same with Option, it deserializes to None by default
Sui has a timeout on sending transaction to the nw consensus and automatically adjusts the timeout based on recent latency from nw it has seen in case of success, or linearly increases in case of timeout. In the stress test network we see quite a few of nw timeouts from the sui transaction submission side, and we indeed [see](http://grafana.shared.internal.sui.io:3000/goto/wqnUWbGVz?orgId=1) that control delay is adjusted accordingly to the timeouts and nw is not stuck. However, it seems that current timeout adjustment logic drops timeout pretty quickly which causes timeouts again. While this is not end of the world, it is noise and probably suboptimal. It also can be noticed that current base timeout value is very close to the actual consensus delay, so the timeout has to be frequently adjusted. Therefore, a short-medium term fix (this PR) is just to bump the delay step for timeout calculations so that it is further away from median nw latency. In the future we can improve timeout calculation by using weighted avg over a longer window instead of weighting current and next timing only.
Sui has a timeout on sending transaction to the nw consensus and automatically adjusts the timeout based on recent latency from nw it has seen in case of success, or linearly increases in case of timeout.
In the stress test network we see quite a few of nw timeouts from the sui transaction submission side, and we indeed see that control delay is adjusted accordingly to the timeouts and nw is not stuck.
However, it seems that current timeout adjustment logic drops timeout pretty quickly which causes timeouts again.
While this is not end of the world, it is noise and probably suboptimal.
It also can be noticed that current base timeout value is very close to the actual consensus delay, so the timeout has to be frequently adjusted. Therefore, a short-medium term fix (this PR) is just to bump the delay step for timeout calculations so that it is further away from median nw latency.
In the future we can improve timeout calculation by using weighted avg over a longer window instead of weighting current and next timing only.