Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ENOBUFS handling for unsolicited messages #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Tuetuopay
Copy link

This can happen when large burst of messages come all of a sudden, which happen very easily when routing protocols are involved (e.g. BGP). The current implementation incorrectly assumes that any failure to read from the socket is akin to the socket closed. This is not the case.

This adds handling for this specific error, which translates to a wrapper struct in the unsolicited messages stream: either a message, or an overrun. This lets applications handle best for their usecase such event: either resync because messages are lost, or do nothing if the listening is informational only (e.g. logging).

This is a direct port of little-dude/netlink#293.

src/framed.rs Outdated Show resolved Hide resolved
@cathay4t
Copy link
Member

cathay4t commented Oct 24, 2022

@Tuetuopay The kernel is treating it as an error. With the changes you suggested, almost every existing rust-netlink crate needs to be changed and their user also just for a rare use case. The rust has a saying: you do not need to pay for what you will not use. Clearly your patch is against this design.

My suggestion is, raise proper error like OverRunError, for any user who cares, they can handle this error by a controlled retry.

@Tuetuopay
Copy link
Author

@Tuetuopay The kernel is treating it as an error. With the changes you suggested, almost every existing rust-netlink crate needs to be changed and their user also just for a rare use case. The rust has a saying: you do not need to pay for what you will not use. Clearly your patch is against this design.

My suggestion is, raise proper error like OverRunError, for any user who cares, they can handle this error by a controlled retry.

@cathay4t Indeed this is more than a simple API breaking change. What would you suggest?

  • change the stream to yield Result<NetlinkPayload, Error> with the error only being an overrun
  • yield NetlinkPayload::Overrun(_) which I thought about originally, but it'd conflate an actual overrun notification received from netlink in a message and an overrun from the socket itself. That's my preferred option.

However I don't completely agree that not everyone needs to implement it. One of the strengths of Rust is exposing you to failure cases through the type system: you have to explicitly ignore the error case, while others languages makes you explicitly handle the error case.

@cathay4t
Copy link
Member

cathay4t commented Nov 4, 2022

I prefer NetlinkPayload::Overrun way.
Meanwhile, the existing user facing crate(e.g. rtnetlink) might need to amend to support this in their macros.rs to convert NetlinkPayload::Overrun to Err(rtnetlink::Error::Overrun). So user of rtnetlink could ignore this specific error.
To even better, we can allow user to say AddressGetRequest::set_auto_retry_on_overrun(true) or AddressGetRequest::set_ignore_overrun(true) to make the user's life eaiser.

@Tuetuopay
Copy link
Author

Tuetuopay commented Dec 14, 2022

I prefer NetlinkPayload::Overrun way.

Sure! Implementing this.

Meanwhile, the existing user facing crate(e.g. rtnetlink) might need to amend to support this in their macros.rs to convert NetlinkPayload::Overrun to Err(rtnetlink::Error::Overrun). So user of rtnetlink could ignore this specific error. To even better, we can allow user to say AddressGetRequest::set_auto_retry_on_overrun(true) or AddressGetRequest::set_ignore_overrun(true) to make the user's life eaiser.

Okay there is a deep misunderstanding of the issue here. This does not fixes issues with request/reply mode (as suggested by AddressGetRequest::), but with subscription mode. In this mode, the kernel sends notifications as fast as it can, without waiting. This does not happen in regular request/response mode because it will wait for consumers to ack messages before sending more.

Basically we can see regular request/response mode as TCP and multicast mode as UDP, if you squint hard enough.

As for allowing the user to ignore those errors, they can already be disabled when opening the stream socket with Socket::set_no_enobufs.

ps: sorry for the very late response, I did not have time to get back to this before.

@Tuetuopay
Copy link
Author

Hi @cathay4t!

I pushed a fully new implementation. This is what was suggested: generate actual NetlinkMessages with a payload of NetlinkPayload::Overrun. This way, users can continue ignoring the error while people that mind (e.g. BGP people) can properly handle the error.

On a related note, I saw rust-netlink/netlink-packet-core#7 and it can indeed be done as I do not use the struct anymore in the new version.

I hope this one is more to your taste :)

@cathay4t
Copy link
Member

@Tuetuopay Yes. The new patch looks good. Please rebase it then we are OK to merge.

Do you have plan to change rtnetlink try_rtnl! for this?

@cathay4t
Copy link
Member

The src/connection.rs has Overrun(_) => unimplemented!("overrun is not handled yet") , can you also patch it?

@yshui
Copy link

yshui commented Sep 23, 2024

I am also hitting the ENOBUFS problem. Did @Tuetuopay lost interests in this PR? I can try to take over as this doesn't look too complex.

@Tuetuopay
Copy link
Author

I am also hitting the ENOBUFS problem. Did @Tuetuopay lost interests in this PR? I can try to take over as this doesn't look too complex.

Pretty much, I stopped using it for the usecase I was hitting. I may use this library again in the future though.

I'll see to address the comments, but if I don't get time for this feel free to pick it up :)

This can happen when large burst of messages come all of a sudden, which
happen very easily when routing protocols are involved (e.g. BGP). The
current implementation incorrectly assumes that any failure to read from
the socket is akin to the socket closed. This is not the case.

This commit adds handling for this specific error by generating a
`NetlinkPayload::Overrun(_)` message that users receive on their
unsolicited message channel. Since this is just an additional message,
there is no breaking change for existing users and they are free to
ignore it if they do not want to handle it, or handle it by e.g. resyncing.
@Tuetuopay
Copy link
Author

Tuetuopay commented Sep 24, 2024

The src/connection.rs has Overrun(_) => unimplemented!("overrun is not handled yet") , can you also patch it?

@cathay4t Well, about that, it was left as unimplemented on purpose. You will get ENOBUFS on the unsolicited message channel, not on the regular request/response channel. Why? Because when using netlink in request/response mode, there is an ack mechanism between the kernel and userspace, precisely to avoid this issue. (well, unless the buffer sizes are set stupidly low, but in such case, this is a non-recoverable error).

I can set it as a fatal error, and make it a break from the forward_responses loop to exit the whole netlink system. Less violent than the unimplemented!. Your call :)

Btw, rebased on master.

@Tuetuopay
Copy link
Author

@Tuetuopay Yes. The new patch looks good. Please rebase it then we are OK to merge.

Do you have plan to change rtnetlink try_rtnl! for this?

Just remembered this. Well, no, for the same reason I left the unimplemented: this is not supposed to happen in a request/response scenario. It really is an unexpected thing to get in req/res.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants