Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix udp port check retry and check all udp ports #10385

Merged
merged 8 commits into from
Jun 14, 2020

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Jun 3, 2020

Problem

Validator starts nevertheless some of its udp port are closed....
Also, it doesn't test all of listening ports.

Summary of Changes

Really abort as failure after the maximum number of retries.
Also, test all the ports.

Context

Found via last-minute checking of #10209

Follow-up #10181, #10291.

@ryoqun ryoqun requested a review from mvines June 3, 2020 09:29
@ryoqun
Copy link
Member Author

ryoqun commented Jun 3, 2020

Odd, test started to fail exactly due to this change...

mvines
mvines previously approved these changes Jun 3, 2020
Copy link
Member

@mvines mvines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Would you mind adding a new test for this around test_get_public_ip_addr() as well please? Clearly there's not enough test coverage here!

@stale
Copy link

stale bot commented Jun 10, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 10, 2020
@ryoqun ryoqun force-pushed the no-start-with-closed-port branch from fdd0d05 to 8f30ba5 Compare June 11, 2020 16:13
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 11, 2020
@mergify mergify bot dismissed mvines’s stale review June 11, 2020 16:14

Pull request has been modified.

@@ -1110,23 +1110,17 @@ pub fn main() {
}

if let Some(ref cluster_entrypoint) = cluster_entrypoint {
let udp_sockets = [
node.sockets.tpu.first(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvines Well, this was broken and caused actual CI failures in this pr...
That's because we open many sockets on the same port and there is no guarantee os feeds the echo-backed packets into the first socket. We must recv from all of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nice find

let udp_sockets = [
node.sockets.tpu.first(),
/*
Enable these ports when `IpEchoServerMessage` supports more than 4 UDP ports:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started to support these as well as a side-effect of addressing same-port-shared sockets.

For this, I only needed just chunks().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome

Err(err) => warn!("udp recv failure: {}", err),
}
});
match receiver.recv_timeout(Duration::from_secs(5)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I dunno why channel() is needed.. I've simplified it just by socket.set_read_timeout()....

let port = udp_socket.local_addr().unwrap().port();
let udp_socket = udp_socket.try_clone().expect("Unable to clone udp socket");
let (sender, receiver) = channel();
std::thread::spawn(move || {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, not join()-ing is a bad taste. I'm pretty sure this leaks a thread or it reads an actual data from the socket at arbitrary time after the validator really start to boot...

Copy link
Member Author

@ryoqun ryoqun Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Oops, I'll fix this for tcp as well...

@ryoqun
Copy link
Member Author

ryoqun commented Jun 11, 2020

@mvines I wanted to solve the mysterious CI error and it turned out that the port checker is broken in another way by #10291...

I'll adjust the mentioned unit test later..

#[derive(Serialize, Deserialize, Default)]
pub(crate) struct IpEchoServerMessage {
tcp_ports: [u16; 4], // Fixed size list of ports to avoid vec serde
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's strive for not breaking ABI. ;)

Copy link
Member

@mvines mvines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking much better than my previous code :)

@@ -1110,23 +1110,17 @@ pub fn main() {
}

if let Some(ref cluster_entrypoint) = cluster_entrypoint {
let udp_sockets = [
node.sockets.tpu.first(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nice find

let udp_sockets = [
node.sockets.tpu.first(),
/*
Enable these ports when `IpEchoServerMessage` supports more than 4 UDP ports:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome

@ryoqun ryoqun changed the title Don't start if udp port is really closed Fix udp port check retry and check all udp ports Jun 11, 2020
&node.sockets.repair,
&node.sockets.serve_repair,
];
udp_sockets.extend(node.sockets.tpu.iter().take(3));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh no.

@codecov
Copy link

codecov bot commented Jun 11, 2020

Codecov Report

Merging #10385 into master will decrease coverage by 0.0%.
The diff coverage is 95.0%.

@@            Coverage Diff            @@
##           master   #10385     +/-   ##
=========================================
- Coverage    81.7%    81.7%   -0.1%     
=========================================
  Files         297      297             
  Lines       69981    70045     +64     
=========================================
+ Hits        57210    57261     +51     
- Misses      12771    12784     +13     

&node.sockets.repair,
&node.sockets.serve_repair,
];
udp_sockets.extend(node.sockets.tpu.iter());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure there is more elegant way to write this?

@ryoqun ryoqun requested a review from mvines June 12, 2020 17:11
@ryoqun
Copy link
Member Author

ryoqun commented Jun 12, 2020

@mvines I've added tests and polished the impl a bit!!

@ryoqun
Copy link
Member Author

ryoqun commented Jun 13, 2020

image

Adding test made diff coverage improved from 80% to 95%. :)

Copy link
Member

@mvines mvines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the effort here, this is so much better now

@ryoqun ryoqun merged commit a39df7e into solana-labs:master Jun 14, 2020
mergify bot pushed a commit that referenced this pull request Jun 14, 2020
* Don't start if udp port is really closed

* Fully check all udp ports

* Remove test code.......

* Add tests and adjust impl a bit

* Add comment

* Move comment a bit

* Move a bit

* clean ups

(cherry picked from commit a39df7e)

# Conflicts:
#	validator/src/main.rs
mergify bot pushed a commit that referenced this pull request Jun 14, 2020
* Don't start if udp port is really closed

* Fully check all udp ports

* Remove test code.......

* Add tests and adjust impl a bit

* Add comment

* Move comment a bit

* Move a bit

* clean ups

(cherry picked from commit a39df7e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants