Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not Fired timers leak into the kernel #211

Closed
franz1981 opened this issue Aug 29, 2023 · 1 comment
Closed

Not Fired timers leak into the kernel #211

franz1981 opened this issue Aug 29, 2023 · 1 comment
Labels
bug Something isn't working improvement
Milestone

Comments

@franz1981
Copy link
Contributor

franz1981 commented Aug 29, 2023

This has been reported by @pveentjer in a private conversation about io_uring Netty mechanics.

  1. schedule a timer in 2 minutes
  2. await till the IoUring event loop park/wait
  3. schedule a second timer 10 seconds which is going to replace the last deadline set
  4. once it fires, the next in-line scheduled task was the previous one of 2 min(now <= 1m50s, likely): the event loop arm a timer for this deadline (<= 1m50s) BUT there was already such a timer, hence we arm it twice!

If we have a huge amount of already registered timers in the future, they will see their registrations to always happen twice, if their deadline is replaced with a new recent one.

Update the timer through a remove (via IORING_TIMEOUT_UPDATE) require >=5.11, while removing the existing one (when the new recent one need to be armed) is >=5.5.
There are several ways to address it, including NOT addressing it, but it can still cause silently to goes OOM or worse (no idea really).

I believe we had no covering for this because our await operations didn't allow any level of concurrency: if we block awaiting, is a blocking operations, period. But if we request to be awaken in the future, and we're awaken earlier, the same "in flight" request is not yet completed, hence allow to enqueue more and more of this.

@franz1981 franz1981 added bug Something isn't working improvement labels Aug 29, 2023
@franz1981
Copy link
Contributor Author

@chrisvest @normanmaurer Netty 5 is affected the same I think

yawkat added a commit to yawkat/netty-incubator-transport-io_uring that referenced this issue Jan 24, 2024
I had a memory issue and @franz1981 suggested netty#211 as the cause. This patch is my fix for that bug, though I don't believe my mem issue was ultimately caused by this.

This PR does the legwork for adding ioringOpTimeoutRemove, and implementing a test. However two things can still be improved:

- [ ] could use IORING_TIMEOUT_UPDATE (see netty#211) to save one sqe.
- [ ] there may be a race in IOUringEventLoop between the addTimeout and the IORING_OP_TIMEOUT handler. If the kernel fires a deadline cqe, then we send a deadline update sqe, and only then we process the first cqe, prevDeadlineNanos is NONE even though we've submitted a new deadline. I'm not sure if this can actually happen since deadline changes should only adjust the deadline downwards, not upwards? Not sure.
@normanmaurer normanmaurer added this to the 0.0.25.Final milestone Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working improvement
Projects
None yet
Development

No branches or pull requests

2 participants