-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix udp port check retry and check all udp ports #10385
Conversation
Odd, test started to fail exactly due to this change... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Would you mind adding a new test for this around test_get_public_ip_addr()
as well please? Clearly there's not enough test coverage here!
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
fdd0d05
to
8f30ba5
Compare
@@ -1110,23 +1110,17 @@ pub fn main() { | |||
} | |||
|
|||
if let Some(ref cluster_entrypoint) = cluster_entrypoint { | |||
let udp_sockets = [ | |||
node.sockets.tpu.first(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvines Well, this was broken and caused actual CI failures in this pr...
That's because we open many sockets on the same port and there is no guarantee os feeds the echo-backed packets into the first socket. We must recv
from all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, nice find
let udp_sockets = [ | ||
node.sockets.tpu.first(), | ||
/* | ||
Enable these ports when `IpEchoServerMessage` supports more than 4 UDP ports: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started to support these as well as a side-effect of addressing same-port-shared sockets.
For this, I only needed just chunks()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome
Err(err) => warn!("udp recv failure: {}", err), | ||
} | ||
}); | ||
match receiver.recv_timeout(Duration::from_secs(5)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I dunno why channel()
is needed.. I've simplified it just by socket.set_read_timeout()
....
let port = udp_socket.local_addr().unwrap().port(); | ||
let udp_socket = udp_socket.try_clone().expect("Unable to clone udp socket"); | ||
let (sender, receiver) = channel(); | ||
std::thread::spawn(move || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, not join()
-ing is a bad taste. I'm pretty sure this leaks a thread or it reads an actual data from the socket at arbitrary time after the validator really start to boot...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Oops, I'll fix this for tcp as well...
#[derive(Serialize, Deserialize, Default)] | ||
pub(crate) struct IpEchoServerMessage { | ||
tcp_ports: [u16; 4], // Fixed size list of ports to avoid vec serde |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's strive for not breaking ABI. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking much better than my previous code :)
@@ -1110,23 +1110,17 @@ pub fn main() { | |||
} | |||
|
|||
if let Some(ref cluster_entrypoint) = cluster_entrypoint { | |||
let udp_sockets = [ | |||
node.sockets.tpu.first(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, nice find
let udp_sockets = [ | ||
node.sockets.tpu.first(), | ||
/* | ||
Enable these ports when `IpEchoServerMessage` supports more than 4 UDP ports: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome
validator/src/main.rs
Outdated
&node.sockets.repair, | ||
&node.sockets.serve_repair, | ||
]; | ||
udp_sockets.extend(node.sockets.tpu.iter().take(3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh no.
Codecov Report
@@ Coverage Diff @@
## master #10385 +/- ##
=========================================
- Coverage 81.7% 81.7% -0.1%
=========================================
Files 297 297
Lines 69981 70045 +64
=========================================
+ Hits 57210 57261 +51
- Misses 12771 12784 +13 |
&node.sockets.repair, | ||
&node.sockets.serve_repair, | ||
]; | ||
udp_sockets.extend(node.sockets.tpu.iter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure there is more elegant way to write this?
@mvines I've added tests and polished the impl a bit!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the effort here, this is so much better now
* Don't start if udp port is really closed * Fully check all udp ports * Remove test code....... * Add tests and adjust impl a bit * Add comment * Move comment a bit * Move a bit * clean ups (cherry picked from commit a39df7e) # Conflicts: # validator/src/main.rs
* Don't start if udp port is really closed * Fully check all udp ports * Remove test code....... * Add tests and adjust impl a bit * Add comment * Move comment a bit * Move a bit * clean ups (cherry picked from commit a39df7e)
Problem
Validator starts nevertheless some of its udp port are closed....
Also, it doesn't test all of listening ports.
Summary of Changes
Really abort as failure after the maximum number of retries.
Also, test all the ports.
Context
Found via last-minute checking of #10209
Follow-up #10181, #10291.