Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth: Fix deadlock issues when running low on buffers #17137

Merged
merged 2 commits into from
Jul 1, 2019

Conversation

jhedberg
Copy link
Member

The first patch is from #16870. The second patch is what I think is a better approach to handling lots of discardable events hogging up buffers that may be needed for other more important events. By allocating these from a separate pool we make sure not to starve the more important buffer pools.

Johan Hedberg (2):
      Bluetooth: Add dedicated pool for HCI_Num_Completed_Packets HCI event
      Bluetooth: Introduce separate pool for discardable events

 drivers/bluetooth/hci/h4.c                         | 11 ++---
 drivers/bluetooth/hci/h5.c                         | 11 +----
 drivers/bluetooth/hci/ipm_stm32wb.c                |  6 +--
 drivers/bluetooth/hci/spi.c                        |  7 +--
 drivers/bluetooth/hci/userchan.c                   | 11 ++---
 include/bluetooth/buf.h                            | 13 ++++++
 include/bluetooth/hci.h                            |  1 +
 samples/bluetooth/mesh/prj.conf                    |  1 -
 samples/bluetooth/mesh/prj_bbc_microbit.conf       |  2 +
 samples/bluetooth/mesh_demo/prj.conf               |  1 -
 samples/boards/nrf52/mesh/onoff-app/prj.conf       |  1 -
 .../mesh/onoff_level_lighting_vnd_app/prj.conf     |  1 -
 samples/boards/reel_board/mesh_badge/prj.conf      |  1 -
 subsys/bluetooth/controller/hci/hci.c              |  4 +-
 subsys/bluetooth/controller/hci/hci_driver.c       |  6 ++-
 subsys/bluetooth/host/Kconfig                      | 21 +++++++++
 subsys/bluetooth/host/hci_core.c                   | 54 ++++++++++++++++++++++
 subsys/bluetooth/host/hci_ecc.c                    |  2 +-
 subsys/bluetooth/host/hci_raw.c                    | 12 +++++
 tests/bluetooth/hci_prop_evt/src/main.c            |  2 +-
 tests/bluetooth/mesh/dbg.conf                      |  1 -
 tests/bluetooth/mesh/friend.conf                   |  1 -
 tests/bluetooth/mesh/gatt.conf                     |  1 -
 tests/bluetooth/mesh/lpn.conf                      |  1 -
 tests/bluetooth/mesh/pb_gatt.conf                  |  1 -
 tests/bluetooth/mesh/prj.conf                      |  1 -
 tests/bluetooth/mesh/proxy.conf                    |  1 -
 tests/bluetooth/mesh_shell/prj.conf                |  1 -
 tests/bluetooth/shell/mesh.conf                    |  1 -
 29 files changed, 121 insertions(+), 56 deletions(-)

Fixes #16864

@jhedberg jhedberg added bug The issue is a bug, or the PR is fixing a bug area: Bluetooth DNM This PR should not be merged (Do Not Merge) backport v1.14-branch labels Jun 28, 2019
@jhedberg jhedberg requested a review from carlescufi June 28, 2019 10:38
@zephyrbot zephyrbot added area: API Changes to public APIs area: Samples Samples area: Tests Issues related to a particular existing or missing test labels Jun 28, 2019
@jhedberg jhedberg force-pushed the evt_pools branch 2 times, most recently from 2ed3edb to 777b08c Compare June 28, 2019 10:41
@zephyrproject-rtos zephyrproject-rtos deleted a comment from zephyrbot Jun 28, 2019
@zephyrbot

This comment has been minimized.

* @return A new buffer.
*/
struct net_buf *bt_buf_get_evt(u8_t evt, s32_t timeout);
struct net_buf *bt_buf_get_evt(u8_t evt, bool discardable, s32_t timeout);
Copy link
Contributor

@Vudentz Vudentz Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the K_NO_WAIT would discard anyway so Im not sure why we need another variable besides that is completely ignored when CONFIG_BT_DISCARDABLE_POOL is not set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vudentz those are not quite the same. A HCI driver might be calling this from an ISR for a non-discardable event, in which case K_NO_WAIT is the right timeout to pass. E.g. the h4.c driver does this, and then if that fails it defers the further allocation attempts to the RX thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I though would use the new pool as well but it seems they are exclusive to each type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm all for eliminating redundancy in parameters, but in this case they're a bit orthogonal and we can't make too many assumptions about getting a K_NO_WAIT.

net_buf_reserve(buf, CONFIG_BT_HCI_RESERVE);
bt_buf_set_type(buf, BT_BUF_EVT);
}

Copy link
Contributor

@Vudentz Vudentz Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should log something here if we cannot allocate a buffer and thus would discard the event.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HCI drivers (at least the H4 one) have the logs for when they discard events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the LOG domain that may not be printed though and since this is behind a config option it may be difficult to analize traces so I don't think it would hurt to say we are dropping here, it doesn't need to be an error/warning though since since it been flagged as discardable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a log, it's not quite dropping that's happening here, rather just a failed allocation. It's then up to the driver to deal with this failure, either by dropping the incoming packet or by deferring to wait in a thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can we forget about this for now. I'm trying to keep the changes minimal since this is to be backported, and adding such a log would mean that we should add it to bt_buf_get_cmd_complete() and bt_buf_get_rx() as well for consistency. In fact, the place you're asking to add it is the least important since discardable buffers are designed to be able to get allocation failures where as for the other pools it's a more severe thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, np.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, np.

@Vudentz any chance you could withdraw your "changes requested' then? :)

@@ -47,6 +48,26 @@ config BT_RX_BUF_LEN
an L2CAP MTU of 65 bytes. On top of this there's the L2CAP header
(4 bytes) and the ACL header (also 4 bytes) which yields 73 bytes.

config BT_DISCARDABLE_POOL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use BT_DISCARDABLE_BUF_COUNT != 0 instead of adding this Kconfig option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at the very least make this hidden and set it based on that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use BT_DISCARDABLE_BUF_COUNT != 0 instead of adding this Kconfig option?

Mainly because it's more natural to have a boolean instead of indirect > 0 in some places, but also because there are two levels of dependencies and it's (IMO) more natural to reference them with two options: first you have the drivers which are able to support the discardable pool (with this PR it's the native controller and H:4), so these are used to enable the pool in general. Next you have the second level, i.e. if the pool has been enabled what the default should be (e.g. a higher value if BT_MESH is enable). I find it easier to understand the dependencies with the split.

That said, if we forget about the slightly better readability of having a boolean the way to represent this with a single DISCARDABLE_BUF_COUNT option is to transform the DISCARDAB_POOL default y conditions into the DISCARDABLE_BUF_COUNT dependencies. That way it's not != 0 we'll use, rather #if defined(CONFIG_BT_DISCARDABLE_BUF_COUNT) which gets us back to a boolean again. Makes sense? It's slightly worse since we're then using an int type both as an int and as a general boolean (i.e. does the option exist at all or not).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlescufi so if based on my explanation you're still in favour of a single option I'd do what I described above. One concrete difference is that you'll no-longer be able to force the option on unless the dependencies are fulfilled, but I guess that's desirable anyway, since it makes no sense to enable this unless we have a driver that can provide sufficiently nuanced information to the buffer allocators to take advantage of this feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlescufi so the option definition would end up looking like this:

config BT_DISCARDABLE_BUF_COUNT
        int "Number of discardable event buffers"
        range 1 255
        default 20 if BT_MESH
        default 3
        depends on BT_H4 || BT_CTLR
        help
          Number of buffers in a separate buffer pool for events which
          the HCI driver considers discardable. Examples of such events
          could be e.g. Advertising Reports. The benefit of having such
          a pool means that the if there is a heavy inflow of such events
          it will not cause the allocation for other critical events to
          block and may even eliminate deadlocks in some cases.

I've not stretched the minimum to 0 since that makes no sense if we can simply either have this defined to something meaningful or not defined at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlescufi I kind of like this single-option approach now that I tested implementing it - I've updated the PR accordingly.


config BT_DISCARDABLE_BUF_COUNT
int "Number of discardable event buffers"
range 1 255
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the range here to 0 255

@jhedberg
Copy link
Member Author

@xiaoliang314 would you be able to run some tests on this PR? It's a second attempt at solving the deadlocks. The first patch is the same from the original PR for the num_completed_pkts pool, but the second patch is new.

@xiaoliang314
Copy link

@jhedberg Of course, I will test it next Monday.

@jhedberg jhedberg force-pushed the evt_pools branch 2 times, most recently from 04c10a5 to 01e33bd Compare June 29, 2019 16:36
Copy link
Member

@carlescufi carlescufi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks! Much cleaner than the previous approach

@jhedberg jhedberg removed the DNM This PR should not be merged (Do Not Merge) label Jul 1, 2019
@jhedberg
Copy link
Member Author

jhedberg commented Jul 1, 2019

https://github.com/zephyrproject-rtos/zephyr/blob/4cd675c59ce75372d258c07a73b3c103407ffd6b/subsys/bluetooth/host/hci_core.c#L3126-L3132
@jhedberg My advertising can't be used. It reports a warning message here. Is this a miss?

@xiaoliang314 that warning comes when the HCI driver does not use bt_buf_get_cmd_complete() to allocated a buffer for a command complete/status event, or if bt_buf_get_cmd_complete() is used but the host doesn't have any pending command to match it with.

@jhedberg
Copy link
Member Author

jhedberg commented Jul 1, 2019

@xiaoliang314 since you're not using any upstream HCI driver, rather some modified h4.c (if I understood right), have you looked the changes I've done to the upstream h4.c in this PR?

@jhedberg
Copy link
Member Author

jhedberg commented Jul 1, 2019

@xiaoliang314 that warning comes when the HCI driver does not use bt_buf_get_cmd_complete() to allocated a buffer for a command complete/status event, or if bt_buf_get_cmd_complete() is used but the host doesn't have any pending command to match it with.

Sorry, correction: bt_buf_get_cmd_complete() is (after this PR) effectively internal to the host stack. You should just be using bt_buf_get_evt() and it should take care of the rest. What your description sounds like is that you're passing true as the value for the discardable parameter of bt_buf_get_evt() even in the case of a command complete/status event, which is wrong - you should only pass true for advertising reports, and anything else should be false.

@xiaoliang314
Copy link

@jhedberg Yes, only adversting is marked as true, I have referenced the changes in your PR h4.c

@xiaoliang314
Copy link

I will revisit your changes to make sure I have not missed

@xiaoliang314
Copy link

@jhedberg I still can't find the reason, this is my log, for your reference.

bt_hci_core: opcode 0x0c03 pool id 0 pool 0x20009904 != &hci_cmd_pool 0x200099a4
bt_hci_core: opcode 0x1003 pool id 0 pool 0x20009904 != &hci_cmd_pool 0x200099a4

@xiaoliang314
Copy link

xiaoliang314 commented Jul 1, 2019

@jhedberg I am very sorry, I am writing wrong.

@jhedberg
Copy link
Member Author

jhedberg commented Jul 1, 2019

@xiaoliang314 any further update? I haven't found anything wrong myself when re-reviewing the PR, however I did make one change to bt_buf_get_evt(): it now only allocates from the discardable pool when the event code has first been checked and it isn't one of the critical ones the host stack cares about (num complete, cmd status, cmd complete). This shouldn't make much difference though if the driver is correctly implemented. So if you're still investigating, please try also with the latest version of this PR.

@xiaoliang314
Copy link

@jhedberg Thank you, it has been resolved. Your PR is working properly

Johan Hedberg added 2 commits July 1, 2019 15:38
This event is a priority one, so it's not safe to have it use the RX
buffer pool which may be depleted due to non-priority events (e.g.
advertising events). Since the event is consumed synchronously it's
safe to have a single-buffer pool for it. Also introduce a new
bt_buf_get_evt() API for HCI drivers to simplify the driver-side code,
this effectively also deprecates bt_buf_get_cmd_complete() which now
has no in-tree HCI driver users anymore.

Fixes zephyrproject-rtos#16864

Signed-off-by: Johan Hedberg <[email protected]>
Introduce a separate buffer pool for events which the HCI driver
considers discardable. Examples of such events could be e.g.
Advertising Reports. The benefit of having such a pool means that the
if there is a heavy inflow of such events it will not cause the
allocation for other critical events to block and may even eliminate
deadlocks in some cases.

Also update all mesh samples not to specify explicit RX buffer counts
anymore. Instead, create appropriate defaults in Kconfig so that we
only need to override this in the app for cases like the bbc:microbit
with limited memory.

Signed-off-by: Johan Hedberg <[email protected]>
@jhedberg jhedberg merged commit 0d9dab3 into zephyrproject-rtos:master Jul 1, 2019
@jhedberg jhedberg deleted the evt_pools branch July 1, 2019 13:36
@backporting
Copy link

backporting bot commented Jul 1, 2019

The backport to v1.14-branch failed:

Commits ["61f80376e130098570e2b097124e8f4ccc00e7e9","3684c82b5ddde0d3aa25749229ec1043f17db5a7"] could not be cherry-picked on top of v1.14-branch

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub.
git fetch
# Create new working tree.
git worktree add .worktrees/backport v1.14-branch
# Navigate to the new directory.
cd .worktrees/backport
# Cherry-pick all the commits of this pull request and resolve the likely conflicts.
git cherry-pick 61f80376e130098570e2b097124e8f4ccc00e7e9 3684c82b5ddde0d3aa25749229ec1043f17db5a7
# Create a new branch with these backported commits.
git checkout -b backport-17137-to-v1.14-branch
# Push it to GitHub.
git push --set-upstream origin backport-17137-to-v1.14-branch
# Go back to the original working tree.
cd ../..
# Delete the working tree.
git worktree remove .worktrees/backport

Then, create a pull request where the base branch is v1.14-branch and the compare/head branch is backport-17137-to-v1.14-branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Bluetooth area: Samples Samples area: Tests Issues related to a particular existing or missing test bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bluetooth: Mesh: Rx buffer exhaustion causes deadlock
5 participants