You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Very often, the self-hosted runners fail with this message:
The self-hosted runner: Airflow Runner 32 lost communication with the server.
Verify the machine is running and has a healthy network connection.
Anything in your workflow that terminates the runner process, starves it for CPU/Memory,
or blocks its network access can cause this error. |
I have been working on this slowly - my hypothesis is it's a race condition: when the runner is busy it is protected from scale in, it finishes, gets un-protected from scale in, AWS starts terminating it, but before the instance terminates it picks up a new job. Right in time to get hard killed.
My in progress fix is to use a lifecycle hook to not get killed instantly.
Very often, the self-hosted runners fail with this message:
Example failure: https://github.com/apache/airflow/actions/runs/584691417
It happened basically every time (and in many cases more than once) over the last few pushes I've done.
I think we need to get to the root cause of it - I suspect this might have something to do with scaling in/out the runners.
Happy to help solving it - I just need to have access to logs @ashb :).
The text was updated successfully, but these errors were encountered: