Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth manager saves and restores stale adverts in bluetooth.remote_scanners #130432

Closed
agittins opened this issue Nov 12, 2024 · 14 comments · Fixed by #130877
Closed

Bluetooth manager saves and restores stale adverts in bluetooth.remote_scanners #130432

agittins opened this issue Nov 12, 2024 · 14 comments · Fixed by #130877

Comments

@agittins
Copy link

agittins commented Nov 12, 2024

The problem

I maintain Bermuda and I've had a few cases where folks have had enormous /config/.storage/bluetooth.remote_scanners files (9MB or so), containing several months worth of advertisement data from a non-core integration, even after that integration has been uninstalled.

I haven't yet been able to confirm which integration is causing the issue, but I do have a lead I haven't yet tested, so I don't feel comfortable publicly naming it if it's not the one at fault.

The issue for me is that Bermuda tries to process all known device advertisements, but when it hits this (in-memory) cache of over 10,000 adverts systems are bogging down and becoming non-responsive. To be clear - this is my own fault because I am accessing private members of the bluetooth.manager class rather than a proper API, so naturally I get to keep the smouldering consequences of my own questionable choices. I'll be changing Bermuda to harden itself against this issue in either case.

The issue for HA is that some systems are carrying this extra in-memory cache of very stale adverts, presumably indefinitely. I don't know if it only applies to adverts gathered between a certain range of HA versions, or if it's specific to a particuar custom integration's behaviour.

Manually solving the "issue" involves shutting down HA, renaming/removing the bluetooth.remote_scanners file and starting HA again - which is not ideal for folks not familiar with ssh etc.

I think the expire adverts functions are not working in these cases because the adverts belong to a no-longer-present scanner, so they are sitting in the manager's _all_histories but not assigned to any extant scanner. I think this is also why the data persists, because HA might be modifying the saved data only for present scanners, and leaves the not-present scanner's cache data intact - but I haven't dug too deeply into the bluetooth_adaptors/storage.py to be sure.

I have an example file of about 9MB, but as it comes from a user I'd rather not share it publicly, but I have their permission to email or provide a link via dm to my nextcloud - I'm on the discord so can be contacted there, or my email is [email protected] (I gave up on my own privacy decades ago!)

Some redacted excerpts of the bluetooth.remote_scanners file:

image

Timestamps:
image

Adverts:
image

Running the file through jq '.data."52e3fd636f144605f7665a0ae49aca33".discovered_device_advertisement_datas.[].device.address' gives 12585 lines.

What version of Home Assistant Core has the issue?

core-2024.11.1

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

bluetooth

Link to integration documentation on our website

https://www.home-assistant.io/integrations/bluetooth/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2024-11-11 17:45:07.854 INFO (MainThread) [homeassistant.setup] Setup of domain dhcp took 0.00 seconds
2024-11-11 17:45:07.871 INFO (MainThread) [homeassistant.setup] Setting up usb
2024-11-11 17:45:07.872 INFO (MainThread) [homeassistant.setup] Setup of domain usb took 0.00 seconds
2024-11-11 17:45:07.913 INFO (MainThread) [homeassistant.setup] Setting up zeroconf
2024-11-11 17:45:07.932 INFO (MainThread) [homeassistant.components.zeroconf] Starting Zeroconf broadcast
2024-11-11 17:45:07.936 INFO (MainThread) [homeassistant.setup] Setup of domain zeroconf took 0.02 seconds
2024-11-11 17:45:08.035 INFO (MainThread) [homeassistant.setup] Setting up bluetooth
2024-11-11 17:45:10.800 INFO (MainThread) [homeassistant.setup] Setup of domain bluetooth took 2.76 seconds
2024-11-11 17:45:10.801 WARNING (MainThread) [asyncio] Executing <Task finished name='setup component bluetooth' coro=<async_setup_component() done, defined at /workspaces/homeassistant-core/homeassistant/setup.py:145> result=True created at /workspaces/homeassistant-core/homeassistant/util/async_.py:45> took 2.726 seconds
2024-11-11 17:45:10.812 INFO (MainThread) [homeassistant.bootstrap] Setting up stage 2: {'device_automation', 'search', 'backup', 'timer', 'bermuda', 'stt', 'input_button', 'assist_pipeline', 'blueprint', 'onboarding', 'radio_browser', 'media_source', 'mobile_app', 'input_datetime', 'schedule', 'input_select', 'logbook', 'tag', 'scene', 'homeassistant_alerts', 'tts', 'sun', 'ffmpeg', 'go2rtc', 'file_upload', 'zone', 'met', 'kegtron', 'diagnostics', 'config', 'analytics', 'camera', 'inkbird', 'application_credentials', 'input_number', 'xiaomi_ble', 'hardware', 'trace', 'private_ble_device', 'wake_word', 'stream', 'image_upload', 'my', 'system_health', 'counter', 'energy', 'shopping_list', 'input_text', 'lovelace', 'script', 'bluetooth_adapters', 'conversation', 'default_config', 'esphome', 'intent', 'automation', 'input_boolean', 'google_translate', 'device_tracker', 'history', 'person'}
2024-11-11 17:45:10.814 INFO (MainThread) [homeassistant.setup] Setting up wake_word
2024-11-11 17:45:10.815 INFO (MainThread) [homeassistant.setup] Setup of domain wake_word took 0.00 seconds
2024-11-11 17:45:10.815 INFO (MainThread) [homeassistant.setup] Setting up stt
2024-11-11 17:45:10.817 INFO (MainThread) [homeassistant.setup] Setup of domain stt took 0.00 seconds

Additional information

Note the bluetooth integration taking 2.76 seconds to load. This is on my dev machine with SSD, 32GB RAM, running 2024.11.0 in a devcontainer in vscode.

My apologies for not being able to provide complete diagnostics due to it being client data (and bluetooth diags not being redacted for MACs), but happy to provide them privately.

I have experimented with a "repair" locally, and it does clean the in-memory cache by identifying any unclaimed scanner's adverts, but doesn't affect the on-disk cache after a restart, plus this is probably not suitable for general use since a user might reboot multiple times while a scanner is not present, but still want the history restored later. The timestamps from the sample data appear to be unix epoch timestamps (vs monotonic).

        deleteme=[]
        for devkey, device in self.manager._all_history.items():  # noqa: SLF001
            if device.source not in self.manager._connectable_scanners | self.manager._non_connectable_scanners:
                deleteme.append(devkey)
        for key in deleteme:
            del self.manager._all_history[key]
@home-assistant
Copy link

Hey there @bdraco, mind taking a look at this issue as it has been labeled with an integration (bluetooth) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of bluetooth can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign bluetooth Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


bluetooth documentation
bluetooth source
(message by IssueLinks)

@bdraco
Copy link
Member

bdraco commented Nov 12, 2024

Since monotonic time is only guaranteed to be the same per run we have to convert to epoch when saving

It looks like we don't have good guards for the host system experiencing time travel (time suddenly moving forwards or suddenly moving backwards), however it's unclear if that's the issue until I see the data

You can share the data to my Dropbox [email protected] or my gdrive [email protected]

@bdraco
Copy link
Member

bdraco commented Nov 12, 2024

I'm also bdraco on discord if that works better but I may be slow to respond there as it's a busy day for me

@agittins
Copy link
Author

agittins commented Nov 13, 2024

Great, I have shared it to your dropbox (my account there is [email protected]). Also mentioned in a comment on the file the integration I think might be doing the thing.

@bdraco
Copy link
Member

bdraco commented Nov 13, 2024

I'll try to look at it soon, unfortunately today is also turning out to be another very busy day

@bdraco
Copy link
Member

bdraco commented Nov 14, 2024

Was hoping to be able to investigate this week but sadly I'm going to run out of time since I'm working on the aiohttp 3.11 release this week.

Hopefully I'll be able to give this a look next week.

@bdraco
Copy link
Member

bdraco commented Nov 18, 2024

The rouge scanner is storing the timestamps in milliseconds instead of seconds so they never expire

@bdraco
Copy link
Member

bdraco commented Nov 18, 2024

>>> datetime.datetime.fromtimestamp(1729905162218.4646)
Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    datetime.datetime.fromtimestamp(1729905162218.4646)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
ValueError: year 56788 is out of range

@bdraco
Copy link
Member

bdraco commented Nov 18, 2024

I haven't yet been able to confirm which integration is causing the issue, but I do have a lead I haven't yet tested, so I don't feel comfortable publicly naming it if it's not the one at fault.

@agittins Please name it here so it can get fixed.

@bdraco
Copy link
Member

bdraco commented Nov 18, 2024

Bluetooth-Devices/bluetooth-adapters#177 will prevent restoring them so at least they will stop building up but it won't handle the case where the integration is injecting incorrect timestamps at run time which will never expire and pollute the whole stack

@agittins
Copy link
Author

Fantastic, thanks!

Please name it here so it can get fixed.

The custom component is https://github.com/kvj/hass_Bluetooth_Proxy which receives adverts from an Android companion app, https://github.com/kvj/hass_Bluetooth_Proxy_Companion

I've raised an issue there at kvj/hass_Bluetooth_Proxy#3

@agittins
Copy link
Author

agittins commented Nov 21, 2024

I'm tempted to include a repair option in Bermuda for systems that have an unexpectedly large number of adverts, or adverts with impossible timestamps. I'm just not sure if it's a good thing to do or not.

I guess my options are:

  • Leave it alone, the problem will go away in time.
  • Raise a repair and cease processing (to save system stability), with link to this issue or mine.
  • Raise a repair, cease processing and offer to rename the existing bluetooth.remote_scanners cache file, on the expectation that on shutdown HA will write out the current in-memory cache of only active HaBaseScanners. Or will HA also re-write the stale ones?

It feels a bit dirty to be messing with HA's internal storage like that though, but it also feels dirty to leave a system inoperable (either in part or in full) when a fix is possible.

Obviously this isn't something that HA could "bless" in any way, but I'd be interested in your personal opinion on the options and whether this would be an egregious over-step or not. In some ways it feels safer to rename the file directly via the integration than it does to talk a user through doing it via ssh, so 🤷🏼‍♀️

@bdraco
Copy link
Member

bdraco commented Nov 21, 2024

Until the integration is fixed, it's going to keep polluting things at runtime anyway, and leaking memory, so I'd probably just wait for it to be fixed

Additionally, every restart will clean it up now so it's probably not worth worrying about, as the above is probably worse

@agittins
Copy link
Author

Thanks for your thoughts. I think that "user installs my integration, system locks up" is actually worse than all the above! 😅

But yeah, I think I'll just let it "go away" when your fix gets to release.
Thanks for your prompt fix and follow-up, much appreciated!

@github-actions github-actions bot locked and limited conversation to collaborators Dec 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants