-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with jobs failing with "lost communication with the server" errors #466
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This add support for two upcoming enhancements on the GitHub side of self-hosted runners, ephemeral runners, and `workflow_jow` events. You can't use these yet. **These features are not yet generally available to all GitHub users**. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released the necessary features on their end. **Ephemeral runners**: The former, ephemeral runners, is basically the reliable alternative to `--once`, which we've been using when you enabled `ephemeral: true` (default in actions-runner-controller). `--once` has been suffering from a race issue #466. `--ephemeral` fixes that. To enable ephemeral runners with `actions/runner`, you give `--ephemeral` to `config.sh`. This updated version of `actions-runner-controller` does it for you, by using `--ephemeral` instead of `--once` when you set `RUNNER_FEATURE_FLAG_EPHEMERAL=true`. Please read the section `Ephemeral Runners` in the updated version of our README for more information. Note that ephemeral runners is not released on GitHub yet. And `RUNNER_FEATURE_FLAG_EPHEMERAL=true` won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub! **`workflow_job` events**: `workflow_job` is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides `actions-runner-controller` a solid foundation to improve our webhook-based autoscale. Formerly, we've been exploiting webhook events like `check_run` for autoscaling. However, as none of our supported events has included `labels`, you had to configure an HRA to only match relevant `check_run` events. It wasn't trivial. In contrast, a `workflow_job` event payload contains `labels` of runners requested. `actions-runner-controller` is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by `labels` included in the webhook payload. So all you need to use webhook-based autoscale will be to enable `workflow_job` on GitHub and expose actions-runner-controller's webhook server to the internet. Note that the current implementation of `workflow_job` support works in two ways, increment, and decrement. An increment happens when the webhook server receives` workflow_job` of `queued` status. A decrement happens when it receives `workflow_job` of `completed` status. The latter is used to make scaling-down faster so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to `scaleDownDelaySecondsAfterScaleOut `. Please read the section `Example 3: Scale on each `workflow_job` event` in the updated version of our README for more information on its usage.
is this issue resolved now? we are still getting this for self-hosted runners on Linux box? |
@shreyasGit Hey. First of all, this can't be fixed 100% by ARC alone. For example, if you use EC2 spot instances for hosting self-hosted runners, it's unavoidable (as we can't block spot termination). ARC has addressed all the issues related to this so you'd better check your deployment first and think if it's supposed to work without the communication lost error. Also, fundamentally this can be said as an issue in GitHub Actions itself, as it doesn't have any facility to auto-restart jobs that disappeared prematurely. Would you mind considering submitting a feature request to GitHub too? |
Are there ways that |
@DerekTBrown Hey. I think that's fundamentally impossible. Even if we were able to hook into the host's system log to know about which process got OOM killed, we have no easy way to correlate it with a process within a container in a pod and even worse, GitHub Actions doesn't provide an API to "externally" set a workflow job status. However, I thought you could still see a job times out after 10 minutes or so and the job that was running when the runner disappeared (due to whatever reason like OOM) was eventually marked as failed(although without an explicit error message at all). Would that be enough? |
Really appreciate this! ❤️ We currently see this occasionally - from our stats it's ~2.5% of builds of our main CI/CD pipeline. Our setup is that we're using Even outside of Spot VMs, there are all sorts of other imaginable reasons that are hard or impossible to mitigate - e.g. node upgrades, OOM kills and suchlike. The dream for us would be for jobs affected by this to automatically restart from the beginning, provisioning a fresh runner and going again. |
@alyssa-glean for re-running, I'd recommend a workflow that runs on completion of the specific workflows you want (can use a cloud hosted or self-hosted runner for this) which re-triggers the job. We've been doing that at my place-of-work and it works great As for As for What didn't work is setting the pod label As for other general termination behavior, we've noticed our longer running jobs that utilize docker in docker/dind as a sidecar do not terminate gracefully. The system logs and metrics do note that a termination signal is sent, and the runner pod successfully waits. However, the default behavior of the container runtime/kubernetes is to send the termination signal to the main process on all containers in a pod, including docker-in-docker. This kills the docker daemon and causes the tests to fail when the autoscaler decides that the pod should be terminated (which we haven't really figured out why it's happening in the first place). For us this equates to failures on ~13% of our job runs for these tests. It'd be nice for the sidecar to only get terminated if the runner itself is also exited, similar to the idea here. For now, my best idea (without adding more custom stuff to the image) would be to include dind in the runner container, rather than as a sidecar. EDIT: The dind container DOES have that termination mechanism but the docs don't suggest how to set it properly. I peeked in the CRD and found the right way to set it ( |
I downgraded Ubuntu from 22.04 LTS to 20.04 LTS and the workflow is no longer exhausting anything. |
I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this error due to the race condition between the runner agent and GitHub.
This isn't specific to actions-runner-controller and I believe it's an upstream issue. But I'd still like to gather voices and knowledge around it and hopefully find a work-around.
Please see the related issues for more information.
This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.
Verifying if you're affected by this problem
Note that the error can also happen when:
If you encounter the error even after tweaking your pod and node resources, it is likely that it's due to the race between the runner agent and GitHub.
Information
--once
are the goto solutions. But I believe both are subject to this race condition issue.Possible workarounds
--once
flag fromrun.sh
) may "alleviate" this issue, but not completely.The text was updated successfully, but these errors were encountered: