Dealing with jobs failing with "lost communication with the server" errors #466

mumoshu · 2021-04-19T00:37:16Z

I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this error due to the race condition between the runner agent and GitHub.

This isn't specific to actions-runner-controller and I believe it's an upstream issue. But I'd still like to gather voices and knowledge around it and hopefully find a work-around.

Please see the related issues for more information.

This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.

Verifying if you're affected by this problem

Note that the error can also happen when:

The runner container got OOM-killed due to that your runner pod has insufficient resource. Set higher resource requests/limits.
The runner container got OOM-killed due to that your node has insufficient resource and your runner pod had low priority. Use a more resourceful machine as your node.

If you encounter the error even after tweaking your pod and node resources, it is likely that it's due to the race between the runner agent and GitHub.

Information

Even GitHub support seems to say that stopping the runner and using --once are the goto solutions. But I believe both are subject to this race condition issue.

Possible workarounds

Disabling ephemeral runners (Ephemeral Runner: Can we make this optional? #457) (i.e. removing the --once flag from run.sh) may "alleviate" this issue, but not completely.
Don't use ephemeral runners and stop runners only in the maintenance window you've defined, while telling your colleagues to not run jobs while in the maintenance window. (The downside of this approach is that you can't rolling-update runners outside of the maintenance window
Restart the whole workflow run whenever any job in it failed (Note that we can't retry individual job on GitHub Actions today)

The text was updated successfully, but these errors were encountered:

This add support for two upcoming enhancements on the GitHub side of self-hosted runners, ephemeral runners, and `workflow_jow` events. You can't use these yet. **These features are not yet generally available to all GitHub users**. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released the necessary features on their end. **Ephemeral runners**: The former, ephemeral runners, is basically the reliable alternative to `--once`, which we've been using when you enabled `ephemeral: true` (default in actions-runner-controller). `--once` has been suffering from a race issue #466. `--ephemeral` fixes that. To enable ephemeral runners with `actions/runner`, you give `--ephemeral` to `config.sh`. This updated version of `actions-runner-controller` does it for you, by using `--ephemeral` instead of `--once` when you set `RUNNER_FEATURE_FLAG_EPHEMERAL=true`. Please read the section `Ephemeral Runners` in the updated version of our README for more information. Note that ephemeral runners is not released on GitHub yet. And `RUNNER_FEATURE_FLAG_EPHEMERAL=true` won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub! **`workflow_job` events**: `workflow_job` is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides `actions-runner-controller` a solid foundation to improve our webhook-based autoscale. Formerly, we've been exploiting webhook events like `check_run` for autoscaling. However, as none of our supported events has included `labels`, you had to configure an HRA to only match relevant `check_run` events. It wasn't trivial. In contrast, a `workflow_job` event payload contains `labels` of runners requested. `actions-runner-controller` is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by `labels` included in the webhook payload. So all you need to use webhook-based autoscale will be to enable `workflow_job` on GitHub and expose actions-runner-controller's webhook server to the internet. Note that the current implementation of `workflow_job` support works in two ways, increment, and decrement. An increment happens when the webhook server receives` workflow_job` of `queued` status. A decrement happens when it receives `workflow_job` of `completed` status. The latter is used to make scaling-down faster so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to `scaleDownDelaySecondsAfterScaleOut `. Please read the section `Example 3: Scale on each `workflow_job` event` in the updated version of our README for more information on its usage.

shreyasGit · 2022-06-20T07:44:15Z

is this issue resolved now? we are still getting this for self-hosted runners on Linux box?

mumoshu · 2022-06-20T07:54:09Z

@shreyasGit Hey. First of all, this can't be fixed 100% by ARC alone. For example, if you use EC2 spot instances for hosting self-hosted runners, it's unavoidable (as we can't block spot termination). ARC has addressed all the issues related to this so you'd better check your deployment first and think if it's supposed to work without the communication lost error.

Also, fundamentally this can be said as an issue in GitHub Actions itself, as it doesn't have any facility to auto-restart jobs that disappeared prematurely. Would you mind considering submitting a feature request to GitHub too?

DerekTBrown · 2022-09-14T19:20:57Z

Note that the error can also happen when: The runner container got OOM-killed due to that your runner pod has insufficient resource. Set higher resource requests/limits.

Are there ways that actions-runner-controller or actions-runner could more gracefully handle the OOM-killed case? Could we somehow report OOM kills to the GitHub UI? Could we run a separate OOM killer inside the workers to kill the workflow before it exceeds memory limits?

mumoshu · 2022-09-15T01:08:22Z

@DerekTBrown Hey. I think that's fundamentally impossible. Even if we were able to hook into the host's system log to know about which process got OOM killed, we have no easy way to correlate it with a process within a container in a pod and even worse, GitHub Actions doesn't provide an API to "externally" set a workflow job status.

However, I thought you could still see a job times out after 10 minutes or so and the job that was running when the runner disappeared (due to whatever reason like OOM) was eventually marked as failed(although without an explicit error message at all). Would that be enough?

alyssaruth · 2023-07-28T10:43:50Z

This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.

Really appreciate this! ❤️

We currently see this occasionally - from our stats it's ~2.5% of builds of our main CI/CD pipeline. Our setup is that we're using actions-runner-controller within GKE, and the nodes that the runners get spun up on are Spot VMs. We do this because it's a massive cost saving - however, it inevitably means sometimes we bump into this error when a node that's hosting a runner gets taken away from us by Google.

Even outside of Spot VMs, there are all sorts of other imaginable reasons that are hard or impossible to mitigate - e.g. node upgrades, OOM kills and suchlike. The dream for us would be for jobs affected by this to automatically restart from the beginning, provisioning a fresh runner and going again.

nsheaps · 2023-08-28T19:50:41Z

@alyssa-glean for re-running, I'd recommend a workflow that runs on completion of the specific workflows you want (can use a cloud hosted or self-hosted runner for this) which re-triggers the job. We've been doing that at my place-of-work and it works great

As for lost communication errors, for us, this was mainly caused by custom timeout logic within our github actions workflows, which attempted to disable job control, then use the process group ID to determine which process to kill when the timeout expires. We've since changed this to using the linux native timeout command with job control and the problem has mostly resolved itself, except for some cases when github itself is having issues.

As for runner shutdown errors, this has been fully mitigated for us by the graceful termination suggestions here. This has been confirmed by both the logs on the runner (which we export to ensure we still have them after the pod dies, by building a custom image that has fluentd in it which pipes the runner logs, runner worker logs, and the actions runner daemon to stdout), and prometheus metrics, which we send to datadog.

What didn't work is setting the pod label cluster-autoscaler.kubernetes.io/safe-to-evict=false, it seemed to have no effect (at least on EKS)

As for other general termination behavior, we've noticed our longer running jobs that utilize docker in docker/dind as a sidecar do not terminate gracefully. The system logs and metrics do note that a termination signal is sent, and the runner pod successfully waits. However, the default behavior of the container runtime/kubernetes is to send the termination signal to the main process on all containers in a pod, including docker-in-docker. This kills the docker daemon and causes the tests to fail when the autoscaler decides that the pod should be terminated (which we haven't really figured out why it's happening in the first place). For us this equates to failures on ~13% of our job runs for these tests. It'd be nice for the sidecar to only get terminated if the runner itself is also exited, similar to the idea here. For now, my best idea (without adding more custom stuff to the image) would be to include dind in the runner container, rather than as a sidecar.

EDIT: The dind container DOES have that termination mechanism but the docs don't suggest how to set it properly. I peeked in the CRD and found the right way to set it (dockerEnv)

signor-mike · 2024-06-27T16:35:48Z

I downgraded Ubuntu from 22.04 LTS to 20.04 LTS and the workflow is no longer exhausting anything.

This comment has been minimized.

Sign in to view

mumoshu mentioned this issue Apr 26, 2021

System.InvalidOperationException: Not configured @ ...Configuration.ConfigurationManager.LoadSettings() #458

Closed

mumoshu added documentation Improvements or additions to documentation help wanted Extra attention is needed labels May 16, 2021

mumoshu mentioned this issue Jun 9, 2021

QUESTION: What is terminating the DinD container? #616

Closed

inahga mentioned this issue Jul 9, 2021

Draining nodes without interrupting busy runners #643

Closed

mumoshu mentioned this issue Jul 12, 2021

Actions time out after some amount of time #694

Open

mumoshu mentioned this issue Aug 10, 2021

feat: Workflow job based ephemeral runner scaling #721

Merged

mumoshu mentioned this issue Aug 15, 2021

Intermittent permissions errors from multiple GHA runs on self-hosted runners #720

Closed

1 task

actions deleted a comment from 1818l Oct 17, 2021

raisedadead mentioned this issue Jan 7, 2022

Self-hosted runners on GitHub Actions loses connection before job completion freeCodeCamp/news#133

Closed

int128 mentioned this issue Aug 27, 2022

Metric of "lost communication with the server" error int128/datadog-actions-metrics#444

Closed

robrap mentioned this issue Jul 6, 2023

edx-platform unit tests error: Self-hosted runner loses communication with the server openedx/edx-platform#32671

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with jobs failing with "lost communication with the server" errors #466

Dealing with jobs failing with "lost communication with the server" errors #466

mumoshu commented Apr 19, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

shreyasGit commented Jun 20, 2022

mumoshu commented Jun 20, 2022 •

edited

Loading

DerekTBrown commented Sep 14, 2022

mumoshu commented Sep 15, 2022 •

edited

Loading

alyssaruth commented Jul 28, 2023

nsheaps commented Aug 28, 2023 •

edited

Loading

signor-mike commented Jun 27, 2024

Dealing with jobs failing with "lost communication with the server" errors #466

Dealing with jobs failing with "lost communication with the server" errors #466

Comments

mumoshu commented Apr 19, 2021 • edited Loading

Verifying if you're affected by this problem

Information

Possible workarounds

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

shreyasGit commented Jun 20, 2022

mumoshu commented Jun 20, 2022 • edited Loading

DerekTBrown commented Sep 14, 2022

mumoshu commented Sep 15, 2022 • edited Loading

alyssaruth commented Jul 28, 2023

nsheaps commented Aug 28, 2023 • edited Loading

signor-mike commented Jun 27, 2024

mumoshu commented Apr 19, 2021 •

edited

Loading

mumoshu commented Jun 20, 2022 •

edited

Loading

mumoshu commented Sep 15, 2022 •

edited

Loading

nsheaps commented Aug 28, 2023 •

edited

Loading