Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to unconfig runner before deleting the pod to recreate #1125

Merged
merged 2 commits into from
Feb 19, 2022

Conversation

TingluoHuang
Copy link
Member

@TingluoHuang TingluoHuang commented Feb 16, 2022

There is a race condition between ARC and GitHub service about deleting runner pod.

  • The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod.
  • A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod.
  • The ARC delete the runner pod which cause the in-progress job to end up canceled.

To avoid this race condition, I am calling r.unregisterRunner() before deleting the pod.

  • r.unregisterRunner() will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod.
  • r.unregisterRunner will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is.

TODO: I need to do some E2E tests to force the race condition to happen.

Ref #911

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 17, 2022

@TingluoHuang Hey! Thanks a lot for your contribution.

I completely forgot the fact that the deletion of a runner on scale down is usually handled by the processRunnerDeletion function in the same Reconcile process, rather than where you patched.

processRunnerDeletion works in concert with Kubernetes's built-in functionality that cascade-deletes resources. It firstly unregisters the runner, then removes the k8s resource finalizer(which is a marker that prevents K8s from cascade-deletion), so that after the reconciliation success, K8s detects the runner has no finalizer and commit cascade-deleting the runner and the runner pod.

Your patch would instead fix a race issue when the runner pod needs to be recreated (due to token update, possible registration timeout, recreating the ephemeral runner's pod that stopped by itself). It's still a nice thing to have so I'm definitely going to merge this soon!

But note that this fix isn't perfect can be incomplete, as we apparently saw the usual runner pod termination process(=processRunnerDeletion), which is already handled the way you do in this PR, doesn't seem like race-free.

That said, after this PR gets merged, we might need to implement a much longer grace period between the time ARC unregisters the runner and it deletes the runner pod, as we've been discussing at #1085 (comment)

@mumoshu mumoshu changed the title Try to unconfig runner before delete the pod. Try to unconfig runner before deleting the pod to recreate Feb 17, 2022
@mumoshu
Copy link
Collaborator

mumoshu commented Feb 19, 2022

But note that this fix isn't perfect can be incomplete, as we apparently saw the usual runner pod termination process(=processRunnerDeletion), which is already handled the way you do in this PR, doesn't seem like race-free.

I noticed that this can be my mistake. If we can safely assume that GitHub Actions can guarantee that no runner will get a new job assigned after a successful RemoveRunner call, this can be the enough and sufficient fix for the race issue.

cc/ @jbkc85

@mumoshu mumoshu merged commit 0b9bef2 into actions:master Feb 19, 2022
@TingluoHuang TingluoHuang deleted the users/tihuang/racefix branch February 20, 2022 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants