Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers: nrfx: fix USB in endpoint data race #71306

Closed
wants to merge 2 commits into from

Conversation

khoek
Copy link

@khoek khoek commented Apr 10, 2024

The function usb_dc_ep_write() races against its caller because it does not copy the passed data buffer as expected by its contract, and as the other drivers do. Thus the TX data may change from underneath the driver while it is pending.

Alongside the output endpoint RX buffers which already exist, we define an input endpoint TX buffer for each endpoint and copy into TX data into it before transmitting. The new buffer is protected by the same lock which prevents a write being issued while an existing write is in progress.

This bug was discovered on a Kinesis Adv360 keyboard running ZMK, and was observed to very reliably cause keys with a sufficiently high keycode (hence last to be transmitted) to be dropped. With Wireshark two TX messages were recorded on each keypress (corresponding to key press and key release), but both messages contained the same contents (no keys pressed). Only a key press-release combination generated by a macro-like mode of ZMK is fast enough to trigger the bug. My proposed fix to ZMK, the PR zmkfirmware/zmk#2257, simply copies the data into a temporary buffer before the call and immediately fixed the problem.

This commit also fixes the bug now using a vanilla copy of ZMK, and has been tested to work on real hardware when backported to ZMK's Zephyr fork.

Closes #71299.

Signed-off-by: Keeley Hoek [email protected]

khoek added 2 commits April 10, 2024 03:13
Rename USB output endpoint buffer member of the endpoint context structure
in preparation for the addition of input endpoint buffers.

Signed-off-by: Keeley Hoek <[email protected]>
The function `usb_dc_ep_write()` races against its caller because it does
not copy the passed data buffer as expected by its contract, and as the
other drivers do. Thus the TX data may change from underneath the driver
while it is pending.

Alongside the output endpoint RX buffers which already exist, we define
an input endpoint TX buffer for each endpoint and copy into TX data into
it before transmitting. The new buffer is protected by the same lock
which prevents a write being issued while an existing write is in
progress.

This bug was discovered on a Kinesis Adv360 keyboard running ZMK, and
was observed to very reliably cause keys with a sufficiently high
keycode (hence last to be transmitted) to be dropped. With Wireshark two
TX messages were recorded on each keypress (corresponding to key press
and key release), but both messages contained the same contents (no keys
pressed). Only a key press-release combination generated by a macro-like
mode of ZMK is fast enough to trigger the bug. My proposed fix to ZMK,
the PR zmkfirmware/zmk#2257, simply copies the
data into a temporary buffer before the call and immediately fixed the
problem.

This commit also fixes the bug now using a vanilla copy of ZMK, and has
been tested to work on real hardware when backported to ZMK's Zephyr
fork.

Closes zephyrproject-rtos#71299.

Signed-off-by: Keeley Hoek <[email protected]>
@zephyrbot zephyrbot added the area: USB Universal Serial Bus label Apr 10, 2024
Copy link

Hello @khoek, and thank you very much for your first pull request to the Zephyr project!
Our Continuous Integration pipeline will execute a series of checks on your Pull Request commit messages and code, and you are expected to address any failures by updating the PR. Please take a look at our commit message guidelines to find out how to format your commit messages, and at our contribution workflow to understand how to update your Pull Request. If you haven't already, please make sure to review the project's Contributor Expectations and update (by amending and force-pushing the commits) your pull request if necessary.
If you are stuck or need help please join us on Discord and ask your question there. Additionally, you can escalate the review when applicable. 😊

Copy link
Contributor

@tmon-nordic tmon-nordic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the excellent PR and issue report. I do agree that copy would absolutely eliminate the race. Unfortunately the significant additional RAM usage will likely break many applications and therefore I am hesistant about this solution.

The API is not really claiming that it will copy the data (please provide the exact place where it says so if I am wrong) but at the same time does not restrain caller to keep the buffer intact (this is what I would have implied).

Would it be possible to change the fix to only affect hid_int_ep_write() where you observed the issue? I think other callers are fine with the requirements for buffer to stay intact. Doing it only for HID endpoints would significantly reduce the memory usage.

Comment on lines +1704 to +1707
if (data_len > EP_IN_BUF_MAX_SZ) {
data_len = EP_IN_BUF_MAX_SZ;
}
memcpy(ep_ctx->in_buf, data, data_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic, because larger writes that previously worked (were transmitted in full) will be truncated without clear indication to the user (while there is ret_bytes, it allows passing NULL if the caller expects all bytes to be written). Due to this reason this is a breaking change (backwards incompatible).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That sounds like a serious problem.

Comment on lines +215 to +216
static uint8_t ep_in_bufs[CFG_EPIN_CNT][EP_IN_BUF_MAX_SZ]
__aligned(sizeof(uint32_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This increases RAM usage quite significantly and will make many applications fail at build time due to RAM overflow (link failure due to too small RAM region).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending your other comment which sounds like a showstopper, could we perhaps allocate the buffers for each EP on first use?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require use of dynamic memory allocation, which is another can of worms. I'd consider using system heap as even worse.

@khoek
Copy link
Author

khoek commented Apr 10, 2024

Thank you so much for your amazingly fast review. Modifying hid_int_ep_write() instead was exactly my first idea too, but the fact that all of the USB samples usb/hid-mouse, usb/hid-cdc, and sensor/fxos8700-hid would still be broken steered me away from this path (all do not protect the data buffer after calling hid_int_ep_write()---note that the semaphore in sensor/fxos8700-hid is not protecting the data buffer.) The only sample which does not have this issue (usb/hid) is because it is too simple and uses a constant buffer. In addition to these and usb_dc_ep_write() in usb/hid/core.c, I also found other files such as

uint8_t capacity[8];
sys_put_be32(block_count - 1, &capacity[0]);
sys_put_be32(BLOCK_SIZE, &capacity[4]);
return write(capacity, sizeof(capacity));
which build a bunch of buffers on the stack and would also be broken.

On the other hand:

trans->buffer = data;
appears to me to also expect that the buffer is not touched until the application finds out it is done when the passed usb_transfer_callback cb is invoked, given its approach to chunking.

Also, from my attempt at reading all of the other drivers, I believe all others copy the buffer, e.g.

if (data) {
memcpy(ep_state->buf, data, len);
}
or are synchronous, e.g.
status = HAL_PCD_EP_Transmit(&usb_dc_stm32_state.pcd, ep,
(void *)data, len);
if (status != HAL_OK) {
LOG_ERR("HAL_PCD_EP_Transmit failed(0x%02x), %d", ep,
(int)status);
k_sem_give(&ep_state->write_sem);
ret = -EIO;
}
if (!ret && ep == EP0_IN && len > 0) {
/* Wait for an empty package as from the host.
* This also flushes the TX FIFO to the host.
*/
usb_dc_ep_start_read(ep, NULL, 0);
}

Of course, if it is technically infeasible to make a change along these lines due to breaking existing users I understand. But there certainly is a problem to solve somewhere. (Another solution with no memory footprint would of course be to make the call synchronous, but I would imagine that could be a significant breaking change as well.)

@tmon-nordic
Copy link
Contributor

Wouldn't just copying the data in hid_int_ep_write() to internal HID instance report buffer be sufficient to fix it?

Note that the USB stack you modify has multiple design issues and patching it is going to be really hard. The drivers are of much varying quality and due to USB DC API not being perfectly clear driver behavior varies (e.g. with regard to copy or no-copy). For that reason there is an effort to write new USB stack (currently marked as experimental; HID is not yet merged to Zephyr main).

@khoek
Copy link
Author

khoek commented Apr 10, 2024

Yes it would at least fix my problem, and if you would accept such a patch I would happily supply it.

Thanks for the lesson about the history and situation---I make no demands, and appreciate all of your help! :)

@jfischer-no
Copy link
Collaborator

The function usb_dc_ep_write() races against its caller because it does not copy the passed data buffer as expected by its contract, and as the other drivers do. Thus the TX data may change from underneath the driver while it is pending.

It is up to the driver (maybe also depending on the controller and HAL madness) to copy or not, at least the API says that "The supplied usb_ep_callback function will be called when data is transmitted out", which can be interpreted as the caller may not modify the data until e.g. in_ready_cb() is called. Also, modifying hid_int_ep_write() does not make sense, because if the application is designed correctly, it would add an unnecessary buffer.

Copy link
Collaborator

@jfischer-no jfischer-no left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmon-nordic
Copy link
Contributor

Also, modifying hid_int_ep_write() does not make sense, because if the application is designed correctly, it would add an unnecessary buffer.

Then we should at least have proper samples in Zephyr. Having bad samples in Zephyr does not really support the "if the application is designed correctly".

@petejohanson
Copy link
Contributor

Also, modifying hid_int_ep_write() does not make sense, because if the application is designed correctly, it would add an unnecessary buffer.

Then we should at least have proper samples in Zephyr. Having bad samples in Zephyr does not really support the "if the application is designed correctly".

Fully agreed on this. The samples are where many users (myself included) of Zephyr first look at how to do things with the APIs. They certainly shouldn't be complex to the point of scope creep, but should at a minimum show the correct way to properly use the API for real world use.

Copy link

This pull request has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 14 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Aug 10, 2024
@github-actions github-actions bot closed this Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: USB Universal Serial Bus Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

USB core has a data race during write on many platforms
5 participants