-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bluetooth: Fix deadlock issues when running low on buffers #17137
Conversation
2ed3edb
to
777b08c
Compare
This comment has been minimized.
This comment has been minimized.
* @return A new buffer. | ||
*/ | ||
struct net_buf *bt_buf_get_evt(u8_t evt, s32_t timeout); | ||
struct net_buf *bt_buf_get_evt(u8_t evt, bool discardable, s32_t timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the K_NO_WAIT would discard anyway so Im not sure why we need another variable besides that is completely ignored when CONFIG_BT_DISCARDABLE_POOL is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vudentz those are not quite the same. A HCI driver might be calling this from an ISR for a non-discardable event, in which case K_NO_WAIT is the right timeout to pass. E.g. the h4.c driver does this, and then if that fails it defers the further allocation attempts to the RX thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I though would use the new pool as well but it seems they are exclusive to each type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm all for eliminating redundancy in parameters, but in this case they're a bit orthogonal and we can't make too many assumptions about getting a K_NO_WAIT.
subsys/bluetooth/host/hci_core.c
Outdated
net_buf_reserve(buf, CONFIG_BT_HCI_RESERVE); | ||
bt_buf_set_type(buf, BT_BUF_EVT); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should log something here if we cannot allocate a buffer and thus would discard the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HCI drivers (at least the H4 one) have the logs for when they discard events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the LOG domain that may not be printed though and since this is behind a config option it may be difficult to analize traces so I don't think it would hurt to say we are dropping here, it doesn't need to be an error/warning though since since it been flagged as discardable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add a log, it's not quite dropping that's happening here, rather just a failed allocation. It's then up to the driver to deal with this failure, either by dropping the incoming packet or by deferring to wait in a thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can we forget about this for now. I'm trying to keep the changes minimal since this is to be backported, and adding such a log would mean that we should add it to bt_buf_get_cmd_complete()
and bt_buf_get_rx()
as well for consistency. In fact, the place you're asking to add it is the least important since discardable buffers are designed to be able to get allocation failures where as for the other pools it's a more severe thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, np.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, np.
@Vudentz any chance you could withdraw your "changes requested' then? :)
subsys/bluetooth/host/Kconfig
Outdated
@@ -47,6 +48,26 @@ config BT_RX_BUF_LEN | |||
an L2CAP MTU of 65 bytes. On top of this there's the L2CAP header | |||
(4 bytes) and the ACL header (also 4 bytes) which yields 73 bytes. | |||
|
|||
config BT_DISCARDABLE_POOL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use BT_DISCARDABLE_BUF_COUNT != 0
instead of adding this Kconfig option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or at the very least make this hidden and set it based on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use BT_DISCARDABLE_BUF_COUNT != 0 instead of adding this Kconfig option?
Mainly because it's more natural to have a boolean instead of indirect > 0
in some places, but also because there are two levels of dependencies and it's (IMO) more natural to reference them with two options: first you have the drivers which are able to support the discardable pool (with this PR it's the native controller and H:4), so these are used to enable the pool in general. Next you have the second level, i.e. if the pool has been enabled what the default should be (e.g. a higher value if BT_MESH is enable). I find it easier to understand the dependencies with the split.
That said, if we forget about the slightly better readability of having a boolean the way to represent this with a single DISCARDABLE_BUF_COUNT option is to transform the DISCARDAB_POOL default y
conditions into the DISCARDABLE_BUF_COUNT dependencies. That way it's not != 0
we'll use, rather #if defined(CONFIG_BT_DISCARDABLE_BUF_COUNT)
which gets us back to a boolean again. Makes sense? It's slightly worse since we're then using an int
type both as an int and as a general boolean (i.e. does the option exist at all or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlescufi so if based on my explanation you're still in favour of a single option I'd do what I described above. One concrete difference is that you'll no-longer be able to force the option on unless the dependencies are fulfilled, but I guess that's desirable anyway, since it makes no sense to enable this unless we have a driver that can provide sufficiently nuanced information to the buffer allocators to take advantage of this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlescufi so the option definition would end up looking like this:
config BT_DISCARDABLE_BUF_COUNT
int "Number of discardable event buffers"
range 1 255
default 20 if BT_MESH
default 3
depends on BT_H4 || BT_CTLR
help
Number of buffers in a separate buffer pool for events which
the HCI driver considers discardable. Examples of such events
could be e.g. Advertising Reports. The benefit of having such
a pool means that the if there is a heavy inflow of such events
it will not cause the allocation for other critical events to
block and may even eliminate deadlocks in some cases.
I've not stretched the minimum to 0 since that makes no sense if we can simply either have this defined to something meaningful or not defined at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlescufi I kind of like this single-option approach now that I tested implementing it - I've updated the PR accordingly.
|
||
config BT_DISCARDABLE_BUF_COUNT | ||
int "Number of discardable event buffers" | ||
range 1 255 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the range here to 0 255
@xiaoliang314 would you be able to run some tests on this PR? It's a second attempt at solving the deadlocks. The first patch is the same from the original PR for the num_completed_pkts pool, but the second patch is new. |
@jhedberg Of course, I will test it next Monday. |
04c10a5
to
01e33bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good thanks! Much cleaner than the previous approach
@xiaoliang314 that warning comes when the HCI driver does not use |
@xiaoliang314 since you're not using any upstream HCI driver, rather some modified h4.c (if I understood right), have you looked the changes I've done to the upstream h4.c in this PR? |
Sorry, correction: |
@jhedberg Yes, only adversting is marked as true, I have referenced the changes in your PR h4.c |
I will revisit your changes to make sure I have not missed |
@jhedberg I still can't find the reason, this is my log, for your reference.
|
@jhedberg I am very sorry, I am writing wrong. |
@xiaoliang314 any further update? I haven't found anything wrong myself when re-reviewing the PR, however I did make one change to |
@jhedberg Thank you, it has been resolved. Your PR is working properly |
This event is a priority one, so it's not safe to have it use the RX buffer pool which may be depleted due to non-priority events (e.g. advertising events). Since the event is consumed synchronously it's safe to have a single-buffer pool for it. Also introduce a new bt_buf_get_evt() API for HCI drivers to simplify the driver-side code, this effectively also deprecates bt_buf_get_cmd_complete() which now has no in-tree HCI driver users anymore. Fixes zephyrproject-rtos#16864 Signed-off-by: Johan Hedberg <[email protected]>
Introduce a separate buffer pool for events which the HCI driver considers discardable. Examples of such events could be e.g. Advertising Reports. The benefit of having such a pool means that the if there is a heavy inflow of such events it will not cause the allocation for other critical events to block and may even eliminate deadlocks in some cases. Also update all mesh samples not to specify explicit RX buffer counts anymore. Instead, create appropriate defaults in Kconfig so that we only need to override this in the app for cases like the bbc:microbit with limited memory. Signed-off-by: Johan Hedberg <[email protected]>
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub.
git fetch
# Create new working tree.
git worktree add .worktrees/backport v1.14-branch
# Navigate to the new directory.
cd .worktrees/backport
# Cherry-pick all the commits of this pull request and resolve the likely conflicts.
git cherry-pick 61f80376e130098570e2b097124e8f4ccc00e7e9 3684c82b5ddde0d3aa25749229ec1043f17db5a7
# Create a new branch with these backported commits.
git checkout -b backport-17137-to-v1.14-branch
# Push it to GitHub.
git push --set-upstream origin backport-17137-to-v1.14-branch
# Go back to the original working tree.
cd ../..
# Delete the working tree.
git worktree remove .worktrees/backport Then, create a pull request where the |
The first patch is from #16870. The second patch is what I think is a better approach to handling lots of discardable events hogging up buffers that may be needed for other more important events. By allocating these from a separate pool we make sure not to starve the more important buffer pools.
Fixes #16864