-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix runner controller to not recreate restarting ephemeral runner #1085
Conversation
@mumoshu this looks good - and is close to changes I made on my local branch....however I dont believe its the only place where this can happen. Honestly it looks like there is duplicate logic in all three of these controllers - where we are blasting the GitHub API to constantly check for registration. This is where the caching of API calls comes into play, but also you can note that the registration timeout is actually set to a different time on the pod_controller vs others (10 vs 15m I believe) |
Furthermore, one issue to point out is the possibility of registration breaking due to not refreshing the token. I had this in originally that helped with the restarts - though didn't necessarily fix it:
However, this prevented the pods from properly restarting with new tokens and eventually some were stuck in a CrashLoop due to the registration tokens being invalid/expired. I am not sure yet where this condition came from....just something I noticed and had to use |
@jbkc85 Hey! Thank you so much for your detailed feedback. I'm still carefully comparing what's implemented in code and your points, but as a starter-
That said, I think they shouldn't affect your scenario. But I'll definitely keep looking deeper so that we won't miss anything. Please keep posting your feedbacks/concerns/etc. I highly appreciate it. Thank you! |
@jbkc85 This can be a bug that is unrelated to your change! It's intended to refresh the runner registration token used by the runner agent's and finally comparing the pod template hash stored in the pod annotation to see if it needs to recreate the pod(to reflect any updates to the pod): The reason why I think this can be a bug is that it doesn't take the runner token into the account when calculating the pod template hash, even though the hash needs to change when then token is updated to be reflected! We need to add a new unit test to verify this bug first, but I'm almost confident it is. |
im going to try and merge this into my existing hot fix code and try it out over the weekend! Currently things are working a little better - the only issue is upon reconciliation we still have runners that are busy be marked for deletion. I am starting to think this is the process:
Now, on #3, the runner_controller STILL checks to see if its busy (which is another issue with GitHub API calls being spammed btw), and has some potential to remove the damage....but I have seen the controllers log something as busy, to turn around and delete it none the less which I am still trying to sort out. But, for this code, I think its a definite step in the right direction - but also needs to be included in the runnerreplicaset_controller.go!
Thing is, after the pod was actually deleted - the runner was STILL busy...I am not entirely sure why the code would think otherwise. |
@jbkc85 Hey! I greatly appreciate your detailed feedback
That did result in some workflow jobs being canceled prematurely, right? After rethinking all these problems we've discussed recently, I get to think if we might have spotted a fundamental issue in either ephemeral runners and/or how we handle ephemeral runners 🤔 ARC assumes an ephemeral runner's pod to be able to get terminated "at any time" as if it's running a job it should gracefully stop. But what you observed seems the opposite- ARC deleted the runner pod that whose runner process within the runner container was about to restart, the runner agent process did restart, reporting the runner status of So, there clearly seems to be a race condition between the runner pod deletion and the ephemeral runner's restarting process. Does that mean we can't expect the actions/runner process to gracefully stop at any time? Probably That said, probably there might be two ways to move forward:
WDYT? |
Talking to customers, most of the ephemeral users that I’m hearing from don’t want auto-updates… I’d like to start shipping the runner in a container for the k8s users so that you can update the image instead of needing to do it at runtime which I think would help with this situation. But we haven’t done that yet.
I think that this is not possible? If you try to unregister a runner that has started running a job (has |
Also - just for another datapoint - I think that we’ve seen this even when there is no self update involved. |
Thanks for sharing! I was wondering how we can justify not supporting auto-update. Perhaps it's ok if e.g. 20% of less users want auto-updates? 😄
Good to know! I'll try to reread the API reference and give it a try when possible. Another concern for the third option is that adding a delete-runner API call on scale down might result in ARC reaching the API rate limit earlier. I think we've been already suffering from the limit so making it worth isn't the best?
Thanks again for sharing! Yeah, your observation aligns with my theory. The scenario @jbkc85 has shared doesn't seem to involve self update. The race issue might be in the |
I actually don't mind this. I actually like the following:
I like this scenario personally, but it also depends on our ability to cache GitHub API responses for ListRunners due to their overwhelming nature as it is right now (105 runners is the most we can run, and we still hit GitHub API limitations). |
Correct. My scenario is 100% race condition it seems, as the horizontal autoscaler automatically marks a pod as deleted and then....though it sticks around for a little bit, its eventually cleaned up before it should be. I think this can be solved by the 'counter' or 'timer' approach mentioned above. |
@jbkc85 Hey! Thanks a lot for your confirmation and feedback. I don't fully understand step 2 of the ideal process you've proposed, as you'd need to remove all the runner labels(not pod labels) to prevent github from schedullilng any workflow job onto it. Also, the default That said, I think I got the gist of your idea. I suppose what's you're trying to do with step 2 is basically "stop scheduling any jobs onto the runner first, before trying to delete the runner", right? Then we can probably spam Step 1 and 3 are interesting. Your idea on those two steps is to delete the runner only after some "grace period", right? In combination with step 2, it would give us something like "stop scheduling any jobs onto the runner, wait for some grace period, and delete the runner only after that". My understanding is that we might need either step 2 only, or step 1+3, not both. But I'd definitely put more thinking onto your idea. Thanks again for your feedback! |
@jbkc85 In relation to my take of "atomic deletion of a runner" #1087 (comment), I believe either
can be a good workaround to simulate atomic deletion of the runner. We might need to poll list-runners and delete-runner API very often and that's why I mark the former option a workaround. That said, I still think my 1st option #1085 (comment) is the ideal solution. But if it turned out not feasible, the only way forward would be either of them. |
https://docs.github.com/en/rest/reference/actions#remove-a-custom-label-from-a-self-hosted-runner-for-an-organization - all we have to do is add an option to allow for the removal of a specific label and document the ability to do so. I am not sure its the ideal solution, but I do like the steps as mentioned before....put a runner in a grace period for deletion, remove runner from being a scheduled resource, and then delete it after the grace period.
Yes, a grace period almost exactly like the period of time the ACR uses for RegistrationDidTimeout period. Once again, I think it would work really well with removing a label.
So I originally submitted a PR from my own personal repo, but I am using the ACR at my current employment and started a forked repository there. In the PR I actually tried to abstract the entire GitHub runner calls outside of all controllers by using a |
Hey @mumoshu - curious what you're thinking about how we should solve this moving forward. 🤔 |
@mumoshu - a few updates:
Both of these have worked decently well, though have not resolved all of our issues just yet. I am continuing to monitor, but with no. 1 we have eliminated a lot of our GitHub API limitations, and no. 2 has increased the stability of our runners - and I plan on testing it out with a much smaller scale down period this week! |
Right. I do wonder how this is going to interact with the GitHub runner though - I think it will work as intended but it will need a lot of testing to see if its truly not going to interrupt anything going on! Also on another note: https://github.com/INNOSOLProject/actions-runner-controller/pull/2 doesn't work as well as intended. The reconciliation process for the runnerreplicaset_controller.go happens more sporadically than expected, and I swapped the code to use a deletion timestamp which works much like the registrationDidTimeout timestamp. for i := 0; i < n; i++ {
if deletionCandidate, found := r.DeletionCache[deletionCandidates[i].Name]; found {
deletionTimer := 20 * time.Minute
if time.Now().After(deletionCandidate.FirstAdded.Add(deletionTimer)) {
if err := r.Client.Delete(ctx, &deletionCandidates[i]); client.IgnoreNotFound(err) != nil {
log.Error(err, "Failed to delete runner resource")
return ctrl.Result{}, err
}
r.Recorder.Event(&rs, corev1.EventTypeNormal, "RunnerDeleted", fmt.Sprintf("Deleted runner '%s'", deletionCandidates[i].Name))
log.Info(fmt.Sprintf("[DELETION_CANDIDATE] Deleted runner '%s'", deletionCandidates[i].Name))
delete(r.DeletionCache, deletionCandidates[i].Name)
} else {
log.Info(fmt.Sprintf("[DELETION_CANDIDATE] Runner '%s' is scheduled for deletion at %s", deletionCandidates[i].Name, deletionCandidate.FirstAdded.Add(deletionTimer)))
}
} else {
r.DeletionCache[deletionCandidates[i].Name] = RRSReconcilerDeletionCache{
FirstAdded: metav1.Now(),
Time: metav1.Now(),
}
}
} I expect to report a few results here in a bit once I have this code pushed up and running! |
PS: I do realize this is duplication of code - I was trying to make things look pretty to get a test candidate out there...and here we are haha. Its out in 'production' right now so I am still waiting on the results. |
@mumoshu current GitHub code: type RunnerLabelResponse struct {
Count int `json:"total_count"`
Labels []github.RunnerLabels `json:"labels"`
}
// RemoveRunnerLabel removes an existing tag off of a GitHub runner
func (c *Client) RemoveRunnerLabel(ctx context.Context, repo string, runner_id int64, label string) error {
// func (c *Client) NewRequest(method, urlStr string, body interface{}) (*http.Request, error) {
owner, repository, err := splitOwnerAndRepo(repo)
if err != nil {
return fmt.Errorf("unable to split owner/repo from %s", repo)
}
uri := fmt.Sprintf(
"/repos/%s/%s/actions/runners/%d/labels/%s",
owner, repository, runner_id, label,
)
req, err := c.NewRequest(http.MethodDelete, uri, nil)
if err != nil {
return err
} else {
response := RunnerLabelResponse{}
_, err = c.Do(ctx, req, response)
if err != nil {
return err
}
}
return nil
} They dont have a specific call for GitHub runners, so the above is what I am trying. Right now in the testing I am defaulting to a hard coded label value. Any thoughts so far? |
Ah, yes. runnerreplicaset controller triggers a reconcile on a runnerreplicaset resource whenever it or one of its child resources have any changes(this is how a K8s controller generally works), or the sync period has passed(You can configure this once via So counting the number of reconciliations doesn't result in reliably delaying some operation. The way you used |
Looks good so far! To be extra sure- where are you going to hook that RemoveRunnerLabel function into? runnerreplicaset-controller, or runner-controller, or both? My theory was, as early as you remove labels from a non-busy runner(or unregister it, whatever works is ok), the runner is less likely to race for accepting workflow jobs before the pod is finally terminated, which results in the runner pod can be terminated safely and earlier. So hooking your code into to runnerreplicaset_controller seems right at glance. But there may be some folks who use I hope we can implement it in a way so that the runnerreplicaset controller just notify the runner controller through a runner resource annotation or field. Once notified, the runner controller immediately checks for the runner status, and if and only if it's not busy it removes labels / unregisters the runner, and starts a timer for the pod deletion. This way runnerreplicaset controller doesn't need to know about how to check runner status, remove labels, etc. Thoughts? |
100% - thats where I copied it from! I figured it was a good use case here, and so far has proven to be a valuable addition. There are still a few cancellations and race conditions - but I am trying to work through them.
I think thats a better path actually. Have a 'deletionTimestamp' field that the runner_controller looks through would be perfect to avoid duplication of code and centralizing the 'deletion' of an asset on one controller. I started in the runner replica simply because thats what we are using and thats what I knew - but I wouldn't mind the code moving to a more proper place. My original thought would be adding a lot of this to the runner resource - including the GitHub status (IsBusy, IsOffline, etc) simply because it shouldn't need to be checked by other controllers - they should just pull in the runner resource and verify there. I just haven't gotten to that because I am wasn't 100% familiar with how the k8s operator framework works, nor how to add annotations appropriately to the models. |
my thoughts for
deletionScheduled can be an EDIT: deletionScheduled might be a two parter - do we want to include the controller in the status that is prompting a deletion of the runner? |
Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated. This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens. Ref #1085 (comment)
ac017f0
to
25570a0
Compare
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is a potential fix for #911 based on my interpretation of the problem observed by @jbkc85.
In case you've seen ARC abruptly terminate an ephemeral runner that just got restarted, this might be the fix.