Make LIFX update handle transient communication failures #90891

Djelibeybi · 2023-04-06T02:08:09Z

Proposed change

Replaces #90872 from @bdraco with a more thorough refactoring of the update process to remove the unnecessary lock and to allow for up to three timeouts before actually offlining a device.

It also disables polling on all the entities except the Light entity as that update grabs all the required data for all the entities anyway. This makes things significantly faster and doesn't overwhelm the bulbs.

Note that I consider this a code quality improvement rather than a bug fix, though the result is the same.

Type of change

Dependency upgrade
Bugfix (non-breaking change which fixes an issue)
New integration (thank you!)
New feature (which adds functionality to an existing integration)
Deprecation (breaking change to happen in the future)
Breaking change (fix/feature causing existing functionality to break)
Code quality improvements to existing code or addition of tests

Additional information

This PR fixes or closes issue: fixes Lifx integration with many devices frequently goes unavailable #78876
This PR is related to issue:
Link to documentation pull request:

Checklist

The code change is tested and works locally.
Local tests pass. Your PR cannot be merged unless tests pass
There is no commented out code in this PR.
I have followed the development checklist
I have followed the perfect PR recommendations
The code has been formatted using Black (black --fast homeassistant tests)
Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

Documentation added/updated for www.home-assistant.io

If the code communicates with devices, web services, or third-party tools:

The manifest file has all fields filled out correctly.
Updated and included derived files by running: python3 -m script.hassfest.
New or updated dependencies have been added to requirements_all.txt.
Updated by running python3 -m script.gen_requirements_all.
For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

I have reviewed two other open pull requests in this repository.

home-assistant · 2023-04-06T02:08:19Z

Hey there @bdraco, mind taking a look at this pull request as it has been labeled with an integration (lifx) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of lifx can trigger bot actions by commenting:

@home-assistant close Closes the pull request.
@home-assistant rename Awesome new title Renames the pull request.
@home-assistant reopen Reopen the pull request.
@home-assistant unassign lifx Removes the current integration label and assignees on the pull request, add the integration domain after the command.

Djelibeybi · 2023-04-06T04:56:42Z

So the delay I noticed is failing the tests. I'll look into it deeper after the long weekend here in Australia, unless someone else fixes it first. 😉

Djelibeybi · 2023-04-07T07:39:11Z

Converting to draft because I may have stumbled upon a significant performance improvement refactor that needs more testing/tests.

homeassistant/components/lifx/config_flow.py

homeassistant/components/lifx/select.py

homeassistant/components/lifx/sensor.py

homeassistant/components/lifx/coordinator.py

homeassistant/components/lifx/binary_sensor.py

homeassistant/components/lifx/coordinator.py

Djelibeybi · 2023-04-09T00:39:24Z

(I was waiting for the tests to complete before clicking that button... 😛 )

homeassistant/components/lifx/binary_sensor.py

Djelibeybi · 2023-04-09T23:04:24Z

The new event.wait() is causing my LIFX Beams and GU10s to timeout. I've had this happen over 20 times in the past hour, which is unnacceptable IMO when the await asyncio.sleep(0) option results in zero timeouts. I get that it's not ideal, but given the state of aiolifx, we really should go with more effective.

The problem isn't that its not ideal, it could end up using up all the CPU time if the dict never gets emptied and there are definitely paths in aiolifx where that can happen

What about this option:

async with asyncio_timeout(MESSAGE_TIMEOUT):
    while len(self.device.message) > 0:
        await asyncio.sleep(0)

bdraco · 2023-04-09T23:21:17Z

The new event.wait() is causing my LIFX Beams and GU10s to timeout. I've had this happen over 20 times in the past hour, which is unnacceptable IMO when the await asyncio.sleep(0) option results in zero timeouts. I get that it's not ideal, but given the state of aiolifx, we really should go with more effective.

The problem isn't that its not ideal, it could end up using up all the CPU time if the dict never gets emptied and there are definitely paths in aiolifx where that can happen

What about this option:
async with asyncio_timeout(MESSAGE_TIMEOUT):

    while len(self.device.message) > 0:

        await asyncio.sleep(0)

Its still using 100% of the cpu/available event loop run time for until the timeout or the while condition returns False since every time the event loop has free time it's going to do the len check and that return control to the loop runner via the await

Djelibeybi · 2023-04-09T23:23:19Z

Its still using 100% of the cpu/available event loop run time for until the timeout or the while condition returns False

And yet it's recommended by the Python docs: https://docs.python.org/3/library/asyncio-task.html#asyncio.sleep: Setting the delay to 0 provides an optimized path to allow other tasks to run. This can be used by long-running functions to avoid blocking the event loop for the full duration of the function call.

The nature of the fail/success cycle is that it appears to rectify itself immediately after the _async_update_data() method finishes, which is why I went down the path of trying to find ways of getting the method to play nicer with others.

These devices sometimes flakey and generate a lot of noise from drop outs since communication is UDP best-effort. We should only mark them unavailable if its not a momentary blip fixes home-assistant#78876

…ing device Signed-off-by: Avi Miller <[email protected]>

Signed-off-by: Avi Miller <[email protected]>

Includes test that should get coverage of the new code too Signed-off-by: Avi Miller <[email protected]>

bdraco · 2023-04-09T23:39:58Z

Its still using 100% of the cpu/available event loop run time for until the timeout or the while condition returns False

And yet it's recommended by the Python docs: https://docs.python.org/3/library/asyncio-task.html#asyncio.sleep: Setting the delay to 0 provides an optimized path to allow other tasks to run. This can be used by long-running functions to avoid blocking the event loop for the full duration of the function call.

The nature of the fail/success cycle is that it appears to rectify itself immediately after the _async_update_data() method finishes, which is why I went down the path of trying to find ways of getting the method to play nicer with others.

There is nothing wrong with running the sleep once or twice. The problem is running it thousands of times or more

Djelibeybi · 2023-04-09T23:49:23Z

There is nothing wrong with running the sleep once or twice. The problem is running it thousands of times or more

This makes no sense to me: why would relinquishing the event loop consume a CPU? It should have the exact opposite effect of not using any CPU until it actually has stuff to do.

I'll try and get the Python profiler thing working again but until then, my subjective experience is that it uses way less CPU becaues it takes exponentially less time overall to finish. I'm talking orders of magnitude faster this way: from 2-3 seconds per bulb per update to ~0.1-0.3 seconds per bulb.

Signed-off-by: Avi Miller <[email protected]>

bdraco · 2023-04-10T00:01:23Z

If there is nothing else going on and it's waiting for the dict to be empty you get

Task runs and gets to Len check returns true

Return control to event loop via await
loop tasks run
Task resumes and gets to Len check returns true

...

Return control to event loop via await
loop tasks run
Bulb responds and aiolifx removes the entry from the dict
Task resumes and gets to Len check returns false

Task continues on

Djelibeybi · 2023-04-10T00:04:04Z

I'm not sure I'm getting your point or if you're making mine, but that's what I want to happen. I need "loop tasks run" to happen more often to get the queue to empty in the first place. Overall more stuff should happen more often because the loop has more opportunity to run more things while we wait for both bulb and aiolifx to do their thing.

Are you sure it's not 100% because Home Assistant is just able to do lots of other stuff while it's waiting? :)

bdraco · 2023-04-10T00:09:24Z

It will do what you want

... but you could end up with thousands of loop runs until the condition returns false which it will do as fast as the system can perform

All of those loop runs will never block because the task will always be ready to check the condition again and will consume all available cpu time while it's looping since the len check isn't an asyncio wait primitive

Djelibeybi · 2023-04-10T00:12:38Z

... but you could end up with thousands of loop runs until the condition returns false which it will do as fast as the system can perform

So what? If there is literally nothing to do except wait for a LIFX bulb to respond, who cares if a CPU is being pegged at 100% while it waits? I'd be willing to bet that most folks are running on low power, low core count devices, so the overall impact will still be a subjective (and objective) improvement in overall performance.

balloob · 2023-04-10T00:51:30Z

There are always other things to do.

Djelibeybi · 2023-04-10T00:52:50Z

There are always other things to do.

Which is exactly why I want to release the event loop. So those things can be done.

elupus · 2023-04-10T08:03:18Z

There are also other processes and threads on the system, as well as heat concerns with pegging cpu in one thread. You/we will need to find some other method of resolving the issue.

Djelibeybi · 2023-04-10T08:30:10Z

There are also other processes and threads on the system, as well as heat concerns with pegging cpu in one thread. You/we will need to find some other method of resolving the issue.

Except it doesn't happen: I've been running this way for days and my CPUs are not pegged. In fact, the Home Assistant container appears to be using less CPU overall than before.

Either way, this is now your problem not mine. I'll just keep using my working alternative implementation.

Djelibeybi requested a review from bdraco as a code owner April 6, 2023 02:08

home-assistant bot added cla-signed code-quality integration: lifx labels Apr 6, 2023

home-assistant bot assigned bdraco Apr 6, 2023

home-assistant bot added by-code-owner Quality Scale: platinum labels Apr 6, 2023

Djelibeybi mentioned this pull request Apr 6, 2023

Only declare lifx update failure after 3 attempts #90872

Closed

20 tasks

Djelibeybi marked this pull request as draft April 7, 2023 07:38

Djelibeybi force-pushed the lifx_raise_fallback branch from 41e8f66 to 88af53c Compare April 7, 2023 09:26

bdraco reviewed Apr 7, 2023

View reviewed changes

homeassistant/components/lifx/config_flow.py Outdated Show resolved Hide resolved

Djelibeybi changed the title ~~Refactor LIFX data update to allow for up to three timeouts before offlining a device~~ Refactor LIFX integration to remove redunant update requests Apr 8, 2023

Djelibeybi changed the title ~~Refactor LIFX integration to remove redunant update requests~~ Refactor LIFX integration to remove redundant update requests Apr 8, 2023

Djelibeybi force-pushed the lifx_raise_fallback branch from 4c89314 to 2fcf49a Compare April 8, 2023 00:36

Djelibeybi marked this pull request as ready for review April 8, 2023 00:37

bdraco self-requested a review April 8, 2023 01:17