Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(hid): Eliminate data race in USB pathway causing dropped keys #2257

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

khoek
Copy link
Contributor

@khoek khoek commented Apr 7, 2024

As I found out diagnosing #2253, there is a data race in the USB HID core on many platforms; the call to hid_int_ep_write(), a Zephyr function, doesn't always copy the passed buffer before beginning the USB transaction. I've only read a few of the HAL drivers, but e.g. nrfx and stm32 don't, while rpi_pico, native_posix, and kinetis do.

So the USB keycode report is changing underneath the HAL driver while, for example, a tapped &lt quickly activate-deactivates the key. I originally thought this had something to do with extended NKRO, but actually noone ever noticed for smaller keycode bits just because they were first and thus sent more quickly!

Amazing to me that this hadn't come up before! I guess there are many more bluetooth users, where it seems in practice that there is no such problem (though I haven't checked the code).

Tested working on a real Adv360.

Closes #2253

Comment on lines 23 to 37
#if IS_ENABLED(CONFIG_ZMK_USB_BOOT)
#define HID_BOOT_REPORT_SIZE sizeof(zmk_hid_boot_report_t)
#else
#define HID_BOOT_REPORT_SIZE 0
#endif

#if IS_ENABLED(CONFIG_ZMK_MOUSE)
#define HID_MOUSE_REPORT_SIZE sizeof(struct zmk_hid_mouse_report)
#else
#define HID_MOUSE_REPORT_SIZE 0
#endif

#define HID_EP_WRITE_BUF_SIZE \
MAX(MAX(HID_BOOT_REPORT_SIZE, sizeof(struct zmk_hid_keyboard_report)), \
MAX(sizeof(struct zmk_hid_consumer_report), HID_MOUSE_REPORT_SIZE))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative here is define the buffer size to be a certain fixed constant and then add compile time checks that the buffer is large enough to handle any possible response.

@petejohanson
Copy link
Contributor

Good catch! I am on the fence on your fix. In practice, the semaphore gating should properly prevent the issue with the buffer changing out from under the USB device driver, but I also share the concerns about the buffer sizing macros being... painful, but perhaps necessary.

Any chance you found any upstream guidance on whether the "copy early" or not approach in the drivers is expected? Accidental? Suggested "best practice" for handling?

@khoek
Copy link
Contributor Author

khoek commented Apr 10, 2024

@petejohanson Thanks so much for reviewing this so quickly, given how much activity the project has!

I'm likely saying something you totally already understand, but while the semaphore completely prevents another call to hid_int_ep_write() while a write is in progress, i.e. until Zephyr invokes the in_ready_cb() callback to tell us it's safe to call hid_int_ep_write() again, changes to the underlying buffer during the write are still not protected against. (i.e. the passed buffer pointer is literally the live bitmask of actively pressed keys, as the keyboard processes state changes the buffer will immediately update, and this is the cause of the problem.)

That said, deferring to your confidence that the current implementation is supposed to work, I just had a look at all of the Zephyr samples/ which invoke hid_int_ep_write(), and all but one have the same race condition (the exception not having it because it is too simple). So I now agree that this a bug in Zephyr.

The question is whether it is a bug in hid_int_ep_write() or usb_write(); since usb_write() is described as part of the "Low level API" in the documentation (https://docs.zephyrproject.org/latest/connectivity/usb/device/api/usb_device.html) and the HID code wraps things to make them considerably simpler and is more stateful, I'm inclined to believe it is hid_int_ep_write()'s fault. I'll to make a PR to Zephyr right away to see what the maintainers think.

In the meantime, since this bug means that really wired-USB key reports are getting through by luck, would you at all consider nonetheless implementing a fix along the lines described here temporarily while we depend on a buggy version of Zephyr?

An alternative less-intrusive workaround is that we could arrange that for now the semaphore just blocks the entire function zmk_usb_hid_send_report() until the send is complete, roughly speaking just by moving the semaphore. I don't think there would really be a latency penalty in practice since successive fast calls to zmk_usb_hid_send_report() (e.g. that caused by &lt here) still currently block anyway, just due to waiting on the hid_int_ep_write() callback.

Another option which solves the problem and also potentially improves the current implementation would be for usb_hid.c to maintain a message queue which it processes at its own pace. But this would be a bit more complex and a few more lines of code compared to e.g. the handful of the previous suggestion. And you would know better than I whether the additional complexity was totally unnecessary (there would certainly be a latency improvement, but would this mean much given the speed of current MCUs?).

Please let me know what you think. I'm just sad my F24 key doesn't work! :)

khoek added a commit to khoek/zephyr that referenced this pull request Apr 10, 2024
The function `usb_dc_ep_write()` races against its caller because it does
not copy the passed data buffer as expected by its contract, and as the
other drivers do. Thus the TX data may change from underneath the driver
while it is pending.

Alongside the output endpoint RX buffers which already exist, we define
an input endpoint TX buffer for each endpoint and copy into TX data into
it before transmitting. The new buffer is protected by the same lock
which prevents a write being issued while an existing write is in
progress.

This bug was discovered on a Kinesis Adv360 keyboard running ZMK, and
was observed to very reliably cause keys with a sufficiently high
keycode (hence last to be transmitted) to be dropped. With Wireshark two
TX messages were recorded on each keypress (corresponding to key press
and key release), but both messages contained the same contents (no keys
pressed). Only a key press-release combination generated by a macro-like
mode of ZMK is fast enough to trigger the bug. My proposed fix to ZMK,
the PR zmkfirmware/zmk#2257, simply copies the
data into a temporary buffer before the call and immediately fixed the
problem.

This commit also fixes the bug now using a vanilla copy of ZMK, and has
been tested to work on real hardware when backported to ZMK's Zephyr
fork.

Closes zephyrproject-rtos#71299.

Signed-off-by: Keeley Hoek <[email protected]>
@khoek
Copy link
Contributor Author

khoek commented Apr 10, 2024

FYI, after trying to carefully read all of the Zephyr USB drivers I believe the problem is localized to the nrfx family and have filed zephyrproject-rtos/zephyr#71306.

@khoek
Copy link
Contributor Author

khoek commented Apr 10, 2024

It seems that the current view is that Zephyr is unable to make the guarantee that hid_int_ep_write() copies, and that the examples should be changed to reflect that. In the background a new USB device stack API is being developed (including updated HID API) for the future.

So it appears to me that something along the lines of this PR will be necessary for the time being, at least for nrfx boards.

@petejohanson petejohanson self-assigned this Apr 11, 2024
@caksoylar caksoylar added core Core functionality/behavior of ZMK usb labels Apr 25, 2024
@khoek khoek force-pushed the usb_data_race branch 2 times, most recently from 999f443 to 0a435a1 Compare July 5, 2024 13:24
@khoek khoek closed this Jul 5, 2024
@khoek khoek reopened this Jul 5, 2024
@khoek khoek force-pushed the usb_data_race branch 3 times, most recently from 1e1ff22 to 43636ee Compare July 5, 2024 16:04
@khoek
Copy link
Contributor Author

khoek commented Jul 5, 2024

There's now a Kconfig flag guarding the workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core functionality/behavior of ZMK usb
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NKRO breaks layer-tap with "extended" keycodes
3 participants