Not Fired timers leak into the kernel #211

franz1981 · 2023-08-29T13:31:56Z

This has been reported by @pveentjer in a private conversation about io_uring Netty mechanics.

schedule a timer in 2 minutes
await till the IoUring event loop park/wait
schedule a second timer 10 seconds which is going to replace the last deadline set
once it fires, the next in-line scheduled task was the previous one of 2 min(now <= 1m50s, likely): the event loop arm a timer for this deadline (<= 1m50s) BUT there was already such a timer, hence we arm it twice!

If we have a huge amount of already registered timers in the future, they will see their registrations to always happen twice, if their deadline is replaced with a new recent one.

Update the timer through a remove (via IORING_TIMEOUT_UPDATE) require >=5.11, while removing the existing one (when the new recent one need to be armed) is >=5.5.
There are several ways to address it, including NOT addressing it, but it can still cause silently to goes OOM or worse (no idea really).

I believe we had no covering for this because our await operations didn't allow any level of concurrency: if we block awaiting, is a blocking operations, period. But if we request to be awaken in the future, and we're awaken earlier, the same "in flight" request is not yet completed, hence allow to enqueue more and more of this.

The text was updated successfully, but these errors were encountered:

franz1981 · 2023-08-29T13:32:24Z

@chrisvest @normanmaurer Netty 5 is affected the same I think

@franz1981

I had a memory issue and @franz1981 suggested netty#211 as the cause. This patch is my fix for that bug, though I don't believe my mem issue was ultimately caused by this. This PR does the legwork for adding ioringOpTimeoutRemove, and implementing a test. However two things can still be improved: - [ ] could use IORING_TIMEOUT_UPDATE (see netty#211) to save one sqe. - [ ] there may be a race in IOUringEventLoop between the addTimeout and the IORING_OP_TIMEOUT handler. If the kernel fires a deadline cqe, then we send a deadline update sqe, and only then we process the first cqe, prevDeadlineNanos is NONE even though we've submitted a new deadline. I'm not sure if this can actually happen since deadline changes should only adjust the deadline downwards, not upwards? Not sure.

franz1981 added bug Something isn't working improvement labels Aug 29, 2023

yawkat mentioned this issue Jan 24, 2024

Remove old timeout when deadline changes #232

Merged

2 tasks

normanmaurer closed this as completed in 1a6c356 Feb 19, 2024

normanmaurer added this to the 0.0.25.Final milestone Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not Fired timers leak into the kernel #211

Not Fired timers leak into the kernel #211

franz1981 commented Aug 29, 2023 •

edited

Loading

franz1981 commented Aug 29, 2023

Not Fired timers leak into the kernel #211

Not Fired timers leak into the kernel #211

Comments

franz1981 commented Aug 29, 2023 • edited Loading

franz1981 commented Aug 29, 2023

franz1981 commented Aug 29, 2023 •

edited

Loading