Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(robot-server): fix robot-server blinker task startup causing hw init failure on the Flex. #16483

Merged
merged 2 commits into from
Oct 16, 2024

Conversation

vegano1
Copy link
Contributor

@vegano1 vegano1 commented Oct 14, 2024

Overview

The robot server crashes during initialization after doing firmware updates, this happens because we attempt to use the FrontButtonBlinker.start_blinker task in the Flex. The Opentrons/opentrons#6286 introduced this issue, it added the modern lifetime concept to the FastAPI robot-server startup routine to perform better setup and teardowns. However, it also removed fbl_init which checked if the robot was a Flex before instantiating the FrontButtonBlinkerLight class. This pull request fixes this issue by adding a conditional that checks if we are on a Flex when instantiating the FrontButtonBlinkerLight task.

Closes: RQA-3302

Test Plan and Hands on Testing

  • Make sure the robot server starts up after doing firmware updates on the Flex
  • Make sure the blinker LED on the OT-2 still works

Changelog

  • Dont run the FrontButtonBlinkerLight task on the Flex

Review requests

  • Makes sense?

Risk assessment

low

@vegano1 vegano1 requested a review from a team as a code owner October 14, 2024 19:26
@vegano1 vegano1 changed the title fix(robot-server): fix robot-server light task startup procedure fix(robot-server): fix robot-server light task startup causing hw init failure. Oct 14, 2024
Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks.

I think we could do this a bit less intrusively and preserve more of the non-buggy parts of #16286.

From your description and from the stack trace in the ticket, it seems like my mistake was specifically this chunk:

async def blink_forever() -> None:
while True:
# This should no-op on a Flex.
await hardware.set_lights(button=True)
await asyncio.sleep(0.5)
await hardware.set_lights(button=False)
await asyncio.sleep(0.5)

The bug is that, contrary to the comment, it only mostly no-ops—if the Flex is undergoing a firmware update, it raises.

So could we solve this by any of the following:

  • Making hardware.set_lights(button=...) actually always no-op on a Flex
  • Having FrontButtonLightBlinker.start_blinking() check the robot type and avoid calling hardware.set_lights(button=...) if it's a Flex
  • Having FrontButtonLightBlinker.start_blinking() check the firmware update status and avoid calling hardware.set_lights(button=...) if there's an update in progress
  • Having the calling code in app_setup.py avoid calling blinker.start_blinking() if it's a Flex
  • Adding a commented contextlib.suppress or try/except anywhere

My thinking is that I would really prefer to avoid going back a bunch of free functions that interact with each other through app_state, if we can help it. And I think that scope and lifetime stuff is orthogonal to the missing-conditional problem that you've identified.

@vegano1
Copy link
Contributor Author

vegano1 commented Oct 15, 2024

@SyntaxColoring

Valid points, although

  • The app_setup owns the FrontButtonLightBlinker class instead of the hardware module, which feels odd.
  • The app_setup now also needs to handle the lifecycle of this class, which was done in the hardware_teardown previously.
  • The hardware initialization callback pattern uses this app_setup floating function mechanism, do we not want that moving forward?
  • Having the hardware tasks centralized in the hardware module is much cleaner, even if we are using free-floating functions. The alternative of using lambda functions to achieve the same effect does not feel as clean when we think about where FrontButtonLightBlinker is defined.

@SyntaxColoring
Copy link
Contributor

  • The hardware initialization callback pattern uses this app_setup floating function mechanism, do we not want that moving forward?

Correct, I would like to move away from that.

  • Where it makes sense, I would like these resources to define themselves, first and foremost, as context managers—because that's normal, testable, composable, library-independent, good-practices Python.

  • Then, as a layer above that, I would like app_setup's job to mostly be configuring our FastAPI app with those resources. That basically means entering each resource's context manager, getting its value (e.g. a HardwareAPI object or a ProtocolStore object), and storing it on app.state where our FastAPI endpoint functions can retrieve it later.

    I think this has a bunch of benefits, but here's one specific one. Some of the resources depend on each other. If we define the resources as context managers and merely compose them in app_setup, that dependency wire-up can happen via normal parameter passing explicitly. Compare with the get_task_runner() lines that you have in hardware.py now, which are an implicit ordering dependency.

  • The app_setup owns the FrontButtonLightBlinker class instead of the hardware module, which feels odd.
  • The app_setup now also needs to handle the lifecycle of this class, which was done in the hardware_teardown previously.
  • Having the hardware tasks centralized in the hardware module is much cleaner, even if we are using free-floating functions. The alternative of using lambda functions to achieve the same effect does not feel as clean when we think about where FrontButtonLightBlinker is defined.

Not totally following these points—let's talk about them in-person.

@vegano1 vegano1 force-pushed the RQA-3302-fix-robot-server=startup branch from 1121fa7 to a874ebe Compare October 15, 2024 19:38
@vegano1 vegano1 changed the title fix(robot-server): fix robot-server light task startup causing hw init failure. fix(robot-server): fix robot-server blinker task startup causing hw init failure on the Flex. Oct 15, 2024
@vegano1
Copy link
Contributor Author

vegano1 commented Oct 15, 2024

@SyntaxColoring

We spoke IRL, I have de-coupled the bug fix from the re-factor work I was trying to do so we can move this forward. I will address the reactors another time in a separate pull request. Thanks for the valid points, I understand the direction a lot better now!

Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@@ -153,6 +153,10 @@ async def start_blinking(self, hardware: HardwareControlAPI) -> None:
Blinking will continue until `mark_hardware_init_complete()` and
`mark_persistence_init_complete()` have both been called.
"""
if should_use_ot3():
# Dont run this task on the Flex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth elaborating in this comment why we want to take pains to avoid running this task on the Flex. And we probably want to correct my old comment below that wrongly says that it will no-op on the Flex. (It will raise an exception, not no-op, if there’s a firmware update in progress.)

@vegano1 vegano1 merged commit e3dc8d7 into edge Oct 16, 2024
7 checks passed
@vegano1 vegano1 deleted the RQA-3302-fix-robot-server=startup branch October 16, 2024 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants