-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drivers: nrfx: fix USB in endpoint data race #71306
Conversation
Rename USB output endpoint buffer member of the endpoint context structure in preparation for the addition of input endpoint buffers. Signed-off-by: Keeley Hoek <[email protected]>
The function `usb_dc_ep_write()` races against its caller because it does not copy the passed data buffer as expected by its contract, and as the other drivers do. Thus the TX data may change from underneath the driver while it is pending. Alongside the output endpoint RX buffers which already exist, we define an input endpoint TX buffer for each endpoint and copy into TX data into it before transmitting. The new buffer is protected by the same lock which prevents a write being issued while an existing write is in progress. This bug was discovered on a Kinesis Adv360 keyboard running ZMK, and was observed to very reliably cause keys with a sufficiently high keycode (hence last to be transmitted) to be dropped. With Wireshark two TX messages were recorded on each keypress (corresponding to key press and key release), but both messages contained the same contents (no keys pressed). Only a key press-release combination generated by a macro-like mode of ZMK is fast enough to trigger the bug. My proposed fix to ZMK, the PR zmkfirmware/zmk#2257, simply copies the data into a temporary buffer before the call and immediately fixed the problem. This commit also fixes the bug now using a vanilla copy of ZMK, and has been tested to work on real hardware when backported to ZMK's Zephyr fork. Closes zephyrproject-rtos#71299. Signed-off-by: Keeley Hoek <[email protected]>
Hello @khoek, and thank you very much for your first pull request to the Zephyr project! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the excellent PR and issue report. I do agree that copy would absolutely eliminate the race. Unfortunately the significant additional RAM usage will likely break many applications and therefore I am hesistant about this solution.
The API is not really claiming that it will copy the data (please provide the exact place where it says so if I am wrong) but at the same time does not restrain caller to keep the buffer intact (this is what I would have implied).
Would it be possible to change the fix to only affect hid_int_ep_write()
where you observed the issue? I think other callers are fine with the requirements for buffer to stay intact. Doing it only for HID endpoints would significantly reduce the memory usage.
if (data_len > EP_IN_BUF_MAX_SZ) { | ||
data_len = EP_IN_BUF_MAX_SZ; | ||
} | ||
memcpy(ep_ctx->in_buf, data, data_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic, because larger writes that previously worked (were transmitted in full) will be truncated without clear indication to the user (while there is ret_bytes
, it allows passing NULL if the caller expects all bytes to be written). Due to this reason this is a breaking change (backwards incompatible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That sounds like a serious problem.
static uint8_t ep_in_bufs[CFG_EPIN_CNT][EP_IN_BUF_MAX_SZ] | ||
__aligned(sizeof(uint32_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This increases RAM usage quite significantly and will make many applications fail at build time due to RAM overflow (link failure due to too small RAM region).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending your other comment which sounds like a showstopper, could we perhaps allocate the buffers for each EP on first use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require use of dynamic memory allocation, which is another can of worms. I'd consider using system heap as even worse.
Thank you so much for your amazingly fast review. Modifying zephyr/subsys/usb/device/class/msc.c Lines 445 to 450 in 731f8d4
On the other hand: zephyr/subsys/usb/device/usb_transfer.c Line 239 in 731f8d4
usb_transfer_callback cb is invoked, given its approach to chunking.
Also, from my attempt at reading all of the other drivers, I believe all others copy the buffer, e.g. zephyr/drivers/usb/device/usb_dc_rpi_pico.c Lines 112 to 114 in 9c05618
zephyr/drivers/usb/device/usb_dc_stm32.c Lines 846 to 860 in 9c05618
Of course, if it is technically infeasible to make a change along these lines due to breaking existing users I understand. But there certainly is a problem to solve somewhere. (Another solution with no memory footprint would of course be to make the call synchronous, but I would imagine that could be a significant breaking change as well.) |
Wouldn't just copying the data in Note that the USB stack you modify has multiple design issues and patching it is going to be really hard. The drivers are of much varying quality and due to USB DC API not being perfectly clear driver behavior varies (e.g. with regard to copy or no-copy). For that reason there is an effort to write new USB stack (currently marked as experimental; HID is not yet merged to Zephyr main). |
Yes it would at least fix my problem, and if you would accept such a patch I would happily supply it. Thanks for the lesson about the history and situation---I make no demands, and appreciate all of your help! :) |
It is up to the driver (maybe also depending on the controller and HAL madness) to copy or not, at least the API says that "The supplied usb_ep_callback function will be called when data is transmitted out", which can be interpreted as the caller may not modify the data until e.g. in_ready_cb() is called. Also, modifying hid_int_ep_write() does not make sense, because if the application is designed correctly, it would add an unnecessary buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #71306 (comment)
Then we should at least have proper samples in Zephyr. Having bad samples in Zephyr does not really support the "if the application is designed correctly". |
Fully agreed on this. The samples are where many users (myself included) of Zephyr first look at how to do things with the APIs. They certainly shouldn't be complex to the point of scope creep, but should at a minimum show the correct way to properly use the API for real world use. |
This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time. |
The function
usb_dc_ep_write()
races against its caller because it does not copy the passed data buffer as expected by its contract, and as the other drivers do. Thus the TX data may change from underneath the driver while it is pending.Alongside the output endpoint RX buffers which already exist, we define an input endpoint TX buffer for each endpoint and copy into TX data into it before transmitting. The new buffer is protected by the same lock which prevents a write being issued while an existing write is in progress.
This bug was discovered on a Kinesis Adv360 keyboard running ZMK, and was observed to very reliably cause keys with a sufficiently high keycode (hence last to be transmitted) to be dropped. With Wireshark two TX messages were recorded on each keypress (corresponding to key press and key release), but both messages contained the same contents (no keys pressed). Only a key press-release combination generated by a macro-like mode of ZMK is fast enough to trigger the bug. My proposed fix to ZMK, the PR zmkfirmware/zmk#2257, simply copies the data into a temporary buffer before the call and immediately fixed the problem.
This commit also fixes the bug now using a vanilla copy of ZMK, and has been tested to work on real hardware when backported to ZMK's Zephyr fork.
Closes #71299.
Signed-off-by: Keeley Hoek [email protected]