Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iroh-net): allow the underlying UdpSockets to be rebound #2946

Merged
merged 38 commits into from
Nov 26, 2024

Conversation

dignifiedquire
Copy link
Contributor

@dignifiedquire dignifiedquire commented Nov 18, 2024

Description

In order to handle supsension and exits on mobile. we need to rebind our UDP sockets when they break.

This PR adds the ability to rebind the socket on errors, and does so automatically on known suspension errors for iOS.

When reviewing this, please specifically look at the duration of lock holding, as this is the most sensitive part in this code.

Some references for these errors

TODOs

  • code cleanup
  • testing on actual ios apps, to see if this actually fixes the issues
  • potentially handle port still being in use? this needs some more thoughts

Closes #2939

Breaking Changes

The overall API for netmon::UdpSocket has changed entirely, everything else is the same.

Notes & open questions

  • I have tried putting this logic higher in the stack, but unfortunately that did not work out.
  • We might not want to infinitely rebind a socket if the same error happens over and over again, unclear how to handle this.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

Copy link

github-actions bot commented Nov 18, 2024

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/2946/docs/iroh/

Last updated: 2024-11-26T17:16:36Z

net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
}

/// Marks this socket as needing a rebind
pub fn mark_broken(&self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark_broken is not called from anywhere.

What's the reason this doesn't just set the option to none? Fear of deadlocks? I was going to look if this is a concern, but since it is not called I could not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all in progress, don't worry about the api too much for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer to go with setting the Option to None since it makes this less vulnerable to toctou races.

Maybe there's something that prevents setting is_broken in a non synchronized way with rebind (which is called after is_broken()) in the higher level calls, but since this is exposed as part of the public i think it would be better to have rebinding (the use part) and is_broken (the check part) as a single locked entity.

We currently use the option only for drop but we could simply do the fancy drop routine when there's actually a socket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going a bit further we could have a enum like

enum InnerSock {
	Active(tokio::net::UddSocket),
	Rebind(std::net::SocketAddr),
}

should use a teeny tiny less memory, does not require storing the address when not needed, no need to think about races about whether being broken and actually rebinding are being updated in some incorrect way and uses a single lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's something that prevents setting is_broken in a non synchronized way with rebind (which is called after is_broken()) in the higher level calls, but since this is exposed as part of the public i think it would be better to have rebinding (the use part) and is_broken (the check part) as a single locked entity.

I understand the desire, but mark_broken needs to work without taking a write lock, so it needs to be seperated into an atomic, or its own lock

Copy link

github-actions bot commented Nov 19, 2024

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: f5270fb

@Arqu
Copy link
Collaborator

Arqu commented Nov 19, 2024

Current run

test case throughput_gbps throughput_transfer
iroh 1_to_1 0.65 1.05
iroh 1_to_3 1.46 2.14
iroh 1_to_5 2.88 4.96
iroh 1_to_10 4.31 5.62
iroh 2_to_2 1.05 1.61
iroh 2_to_4 2.03 3.16
iroh 2_to_6 3.19 4.86
iroh 2_to_10 4.58 6.48

From the last merged PR

test case throughput_gbps throughput_transfer
iroh 1_to_1 0.55 0.83
iroh 1_to_3 1.27 1.76
iroh 1_to_5 2.12 2.94
iroh 1_to_10 3.70 4.66
iroh 2_to_2 0.56 0.70
iroh 2_to_4 1.13 1.41
iroh 2_to_6 1.99 2.65
iroh 2_to_10 3.34 4.46

Copy link
Contributor

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed a small piece (what seemed more critical I think?) in any case I think this is the right approach. What we had before "replacing" some Arc was totally wrong

}

/// Marks this socket as needing a rebind
pub fn mark_broken(&self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also prefer to go with setting the Option to None since it makes this less vulnerable to toctou races.

Maybe there's something that prevents setting is_broken in a non synchronized way with rebind (which is called after is_broken()) in the higher level calls, but since this is exposed as part of the public i think it would be better to have rebinding (the use part) and is_broken (the check part) as a single locked entity.

We currently use the option only for drop but we could simply do the fancy drop routine when there's actually a socket

}

/// Marks this socket as needing a rebind
pub fn mark_broken(&self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going a bit further we could have a enum like

enum InnerSock {
	Active(tokio::net::UddSocket),
	Rebind(std::net::SocketAddr),
}

should use a teeny tiny less memory, does not require storing the address when not needed, no need to think about races about whether being broken and actually rebinding are being updated in some incorrect way and uses a single lock

net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
let mut guard = self.socket.write().unwrap();
{
let socket = guard.take().expect("not yet dropped");
drop(socket);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we pay here the penalty of the slow drop of the previous socket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do, but I this needs to be sync because it is called in poll methods, and I don’t see a way around it

// update socket state
let new_state = self.io.with_socket(|socket| {
quinn_udp::UdpSocketState::new(quinn_udp::UdpSockRef::from(socket))
})??;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails the UdpConn is left in an unknown and unusable state right? But this takes &self so there is no real indicator that this is broken afterwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is? all the methods will return an error if this happens

warn!("failed to rebind socket: {:?}", err);
// TODO: improve error
let err =
std::io::Error::new(std::io::ErrorKind::NotConnected, err.to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that quinn will swallow these errors and try to carry on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rebind should directly return std::io::Error? This branch returns NotConnected while in try_send it returns BrokenPipe. Should these not be consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that quinn will swallow these errors and try to carry on.

really? that would sucks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rebind should directly return std::io::Error?

I thought about it, but that would be wrong, because these are send and recv calls, returning an error about binding would be..odd. Which is why I want to return an error which means that the connection is broken.

count = meta.len / meta.stride,
dst = %meta.dst_ip.map(|x| x.to_string()).unwrap_or_default(),
"UDP recv"
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top marks for logging call! :)

net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
let guard = self.socket.read().unwrap();
let Some(socket) = guard.as_ref() else {
return Err(std::io::Error::new(
std::io::ErrorKind::BrokenPipe,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was this error chosen? i mean, why is this one the right error? might be worth documenting in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was the closest to me that indicates that the connection is broken, would love to hear better suggestions.

@matheus23 matheus23 self-assigned this Nov 25, 2024
@dignifiedquire dignifiedquire changed the title [WIP] feat(iroh-net): allow the underlying UdpSockets to be rebound feat(iroh-net): allow the underlying UdpSockets to be rebound Nov 25, 2024
@dignifiedquire dignifiedquire marked this pull request as ready for review November 25, 2024 18:01
net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
let mut guard = self.socket.write().unwrap();
{
let Some(socket) = guard.take() else {
bail!("cannot rebind closed socket");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should state the reason: socket already closed or no socket to rebind or something like that.

net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
}
Err(err) => {
warn!("failed to rebind socket: {:?}", err);
// TODO: improve error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What still needs to be added to the error? I think this should be resolved before merging as otherwise it'll be here forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I am just not sure what the best way here is

net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
iroh-net/src/magicsock/udp_conn.rs Show resolved Hide resolved
@dignifiedquire
Copy link
Contributor Author

@flub cleaned up and dryed the code a bit, based on your comments

@matheus23 matheus23 removed their assignment Nov 26, 2024
Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still unresolved comments, but essentially this looks fine I think.

}

#[derive(Debug)]
enum SocketState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

net-tools/netwatch/src/udp.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay this is nice :shipit:
(Also make the formatter happy though: cargo make format)

@Arqu
Copy link
Collaborator

Arqu commented Nov 26, 2024

Current branch last run:

test case throughput_gbps throughput_transfer
iroh 1_to_1 1.37 1.37
iroh 1_to_3 4.42 4.42
iroh 1_to_5 7.20 7.21
iroh 1_to_10 11.86 11.88
iroh 2_to_2 3.03 3.04
iroh 2_to_4 5.89 5.90
iroh 2_to_6 8.38 8.39
iroh 2_to_10 13.55 13.57

Current main:

test case throughput_gbps throughput_transfer
iroh 1_to_1 1.36 1.36
iroh 1_to_3 4.42 4.43
iroh 1_to_5 6.89 6.90
iroh 1_to_10 11.33 11.35
iroh 2_to_2 2.72 2.72
iroh 2_to_4 5.54 5.55
iroh 2_to_6 7.87 7.88
iroh 2_to_10 13.32 13.35

Seems like its about the same or slight improvement.

@dignifiedquire dignifiedquire added this pull request to the merge queue Nov 26, 2024
Merged via the queue into main with commit cc9e4e6 Nov 26, 2024
26 of 27 checks passed
@dignifiedquire dignifiedquire deleted the feat-rebinding-socket branch November 28, 2024 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Handle socket failures
6 participants