Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LIFX update handle transient communication failures #90891

Closed
wants to merge 10 commits into from

Conversation

Djelibeybi
Copy link
Contributor

@Djelibeybi Djelibeybi commented Apr 6, 2023

Proposed change

Replaces #90872 from @bdraco with a more thorough refactoring of the update process to remove the unnecessary lock and to allow for up to three timeouts before actually offlining a device.

It also disables polling on all the entities except the Light entity as that update grabs all the required data for all the entities anyway. This makes things significantly faster and doesn't overwhelm the bulbs.

Note that I consider this a code quality improvement rather than a bug fix, though the result is the same.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Black (black --fast homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
  • Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

@home-assistant
Copy link

home-assistant bot commented Apr 6, 2023

Hey there @bdraco, mind taking a look at this pull request as it has been labeled with an integration (lifx) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of lifx can trigger bot actions by commenting:

  • @home-assistant close Closes the pull request.
  • @home-assistant rename Awesome new title Renames the pull request.
  • @home-assistant reopen Reopen the pull request.
  • @home-assistant unassign lifx Removes the current integration label and assignees on the pull request, add the integration domain after the command.

@Djelibeybi
Copy link
Contributor Author

So the delay I noticed is failing the tests. I'll look into it deeper after the long weekend here in Australia, unless someone else fixes it first. 😉

@Djelibeybi Djelibeybi marked this pull request as draft April 7, 2023 07:38
@Djelibeybi
Copy link
Contributor Author

Converting to draft because I may have stumbled upon a significant performance improvement refactor that needs more testing/tests.

@Djelibeybi Djelibeybi changed the title Refactor LIFX data update to allow for up to three timeouts before offlining a device Refactor LIFX integration to remove redunant update requests Apr 8, 2023
@Djelibeybi Djelibeybi changed the title Refactor LIFX integration to remove redunant update requests Refactor LIFX integration to remove redundant update requests Apr 8, 2023
@Djelibeybi Djelibeybi marked this pull request as ready for review April 8, 2023 00:37
@bdraco bdraco self-requested a review April 8, 2023 01:17
@Djelibeybi
Copy link
Contributor Author

(I was waiting for the tests to complete before clicking that button... 😛 )

@bdraco bdraco marked this pull request as draft April 9, 2023 19:56
@Djelibeybi
Copy link
Contributor Author

Djelibeybi commented Apr 9, 2023

The new event.wait() is causing my LIFX Beams and GU10s to timeout. I've had this happen over 20 times in the past hour, which is unnacceptable IMO when the await asyncio.sleep(0) option results in zero timeouts. I get that it's not ideal, but given the state of aiolifx, we really should go with more effective.

The problem isn't that its not ideal, it could end up using up all the CPU time if the dict never gets emptied and there are definitely paths in aiolifx where that can happen

What about this option:

async with asyncio_timeout(MESSAGE_TIMEOUT):
    while len(self.device.message) > 0:
        await asyncio.sleep(0)

@bdraco
Copy link
Member

bdraco commented Apr 9, 2023

The new event.wait() is causing my LIFX Beams and GU10s to timeout. I've had this happen over 20 times in the past hour, which is unnacceptable IMO when the await asyncio.sleep(0) option results in zero timeouts. I get that it's not ideal, but given the state of aiolifx, we really should go with more effective.

The problem isn't that its not ideal, it could end up using up all the CPU time if the dict never gets emptied and there are definitely paths in aiolifx where that can happen

What about this option:

async with asyncio_timeout(MESSAGE_TIMEOUT):

    while len(self.device.message) > 0:

        await asyncio.sleep(0)

Its still using 100% of the cpu/available event loop run time for until the timeout or the while condition returns False since every time the event loop has free time it's going to do the len check and that return control to the loop runner via the await

@Djelibeybi
Copy link
Contributor Author

Djelibeybi commented Apr 9, 2023

Its still using 100% of the cpu/available event loop run time for until the timeout or the while condition returns False

And yet it's recommended by the Python docs: https://docs.python.org/3/library/asyncio-task.html#asyncio.sleep: Setting the delay to 0 provides an optimized path to allow other tasks to run. This can be used by long-running functions to avoid blocking the event loop for the full duration of the function call.

The nature of the fail/success cycle is that it appears to rectify itself immediately after the _async_update_data() method finishes, which is why I went down the path of trying to find ways of getting the method to play nicer with others.

bdraco and others added 9 commits April 10, 2023 09:26
These devices sometimes flakey and generate a lot of noise
from drop outs since communication is UDP best-effort. We
should only mark them unavailable if its not a momentary blip

fixes home-assistant#78876
Signed-off-by: Avi Miller <[email protected]>
Signed-off-by: Avi Miller <[email protected]>
Signed-off-by: Avi Miller <[email protected]>
Includes test that should get coverage of the new code too

Signed-off-by: Avi Miller <[email protected]>
@bdraco
Copy link
Member

bdraco commented Apr 9, 2023

Its still using 100% of the cpu/available event loop run time for until the timeout or the while condition returns False

And yet it's recommended by the Python docs: https://docs.python.org/3/library/asyncio-task.html#asyncio.sleep: Setting the delay to 0 provides an optimized path to allow other tasks to run. This can be used by long-running functions to avoid blocking the event loop for the full duration of the function call.

The nature of the fail/success cycle is that it appears to rectify itself immediately after the _async_update_data() method finishes, which is why I went down the path of trying to find ways of getting the method to play nicer with others.

There is nothing wrong with running the sleep once or twice. The problem is running it thousands of times or more

@Djelibeybi
Copy link
Contributor Author

Djelibeybi commented Apr 9, 2023

There is nothing wrong with running the sleep once or twice. The problem is running it thousands of times or more

This makes no sense to me: why would relinquishing the event loop consume a CPU? It should have the exact opposite effect of not using any CPU until it actually has stuff to do.

I'll try and get the Python profiler thing working again but until then, my subjective experience is that it uses way less CPU becaues it takes exponentially less time overall to finish. I'm talking orders of magnitude faster this way: from 2-3 seconds per bulb per update to ~0.1-0.3 seconds per bulb.

@bdraco
Copy link
Member

bdraco commented Apr 10, 2023

If there is nothing else going on and it's waiting for the dict to be empty you get

Task runs and gets to Len check returns true

Return control to event loop via await
loop tasks run
Task resumes and gets to Len check returns true

Return control to event loop via await
loop tasks run
Task resumes and gets to Len check returns true

Return control to event loop via await
loop tasks run
Task resumes and gets to Len check returns true

Return control to event loop via await
loop tasks run
Task resumes and gets to Len check returns true

Return control to event loop via await
loop tasks run
Task resumes and gets to Len check returns true

...

Return control to event loop via await
loop tasks run
Bulb responds and aiolifx removes the entry from the dict
Task resumes and gets to Len check returns false

Task continues on

@Djelibeybi
Copy link
Contributor Author

Djelibeybi commented Apr 10, 2023

I'm not sure I'm getting your point or if you're making mine, but that's what I want to happen. I need "loop tasks run" to happen more often to get the queue to empty in the first place. Overall more stuff should happen more often because the loop has more opportunity to run more things while we wait for both bulb and aiolifx to do their thing.

Are you sure it's not 100% because Home Assistant is just able to do lots of other stuff while it's waiting? :)

@bdraco
Copy link
Member

bdraco commented Apr 10, 2023

It will do what you want

... but you could end up with thousands of loop runs until the condition returns false which it will do as fast as the system can perform

All of those loop runs will never block because the task will always be ready to check the condition again and will consume all available cpu time while it's looping since the len check isn't an asyncio wait primitive

@Djelibeybi
Copy link
Contributor Author

... but you could end up with thousands of loop runs until the condition returns false which it will do as fast as the system can perform

So what? If there is literally nothing to do except wait for a LIFX bulb to respond, who cares if a CPU is being pegged at 100% while it waits? I'd be willing to bet that most folks are running on low power, low core count devices, so the overall impact will still be a subjective (and objective) improvement in overall performance.

@bdraco bdraco added the second-opinion-wanted Add this label when a reviewer needs a second opinion from another member. label Apr 10, 2023
@balloob balloob closed this Apr 10, 2023
@balloob
Copy link
Member

balloob commented Apr 10, 2023

There are always other things to do.

@Djelibeybi
Copy link
Contributor Author

There are always other things to do.

Which is exactly why I want to release the event loop. So those things can be done.

@jjlawren jjlawren marked this pull request as ready for review April 10, 2023 01:20
@elupus
Copy link
Contributor

elupus commented Apr 10, 2023

There are also other processes and threads on the system, as well as heat concerns with pegging cpu in one thread. You/we will need to find some other method of resolving the issue.

@Djelibeybi
Copy link
Contributor Author

There are also other processes and threads on the system, as well as heat concerns with pegging cpu in one thread. You/we will need to find some other method of resolving the issue.

Except it doesn't happen: I've been running this way for days and my CPUs are not pegged. In fact, the Home Assistant container appears to be using less CPU overall than before.

Either way, this is now your problem not mine. I'll just keep using my working alternative implementation.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
@Djelibeybi Djelibeybi deleted the lifx_raise_fallback branch November 25, 2023 11:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lifx integration with many devices frequently goes unavailable
4 participants