Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix runner controller to not recreate restarting ephemeral runner #1085

Closed
wants to merge 3 commits into from

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Feb 1, 2022

This is a potential fix for #911 based on my interpretation of the problem observed by @jbkc85.

In case you've seen ARC abruptly terminate an ephemeral runner that just got restarted, this might be the fix.

@jbkc85
Copy link

jbkc85 commented Feb 1, 2022

@mumoshu this looks good - and is close to changes I made on my local branch....however I dont believe its the only place where this can happen.

https://github.com/actions-runner-controller/actions-runner-controller/blob/01301d3ce808b00422f2e78014584929d85470b2/controllers/runnerreplicaset_controller.go#L195

https://github.com/actions-runner-controller/actions-runner-controller/blob/c64000e11c95163585db9651354da0f97b06ea80/controllers/runner_pod_controller.go#L287

Honestly it looks like there is duplicate logic in all three of these controllers - where we are blasting the GitHub API to constantly check for registration. This is where the caching of API calls comes into play, but also you can note that the registration timeout is actually set to a different time on the pod_controller vs others (10 vs 15m I believe)

@jbkc85
Copy link

jbkc85 commented Feb 1, 2022

Furthermore, one issue to point out is the possibility of registration breaking due to not refreshing the token. I had this in originally that helped with the restarts - though didn't necessarily fix it:

func runnerRegistrationTimedOut(runner *gogithub.Runner, creationTimeStamp metav1.Time) (bool, string) {
	registrationTimeout := 15 * time.Minute
	currentTime := time.Now()
	registrationDidTimeout := currentTime.Sub(creationTimeStamp.Add(registrationTimeout)) > 0
	if registrationDidTimeout {
		reason := fmt.Sprintf(`Runner %s failed to register itself to GitHub in timely manner. Marking the runner for scale down.
			CAUTION: If you see this a lot, you should investigate the root cause. See https://github.com/actions-runner-controller/actions-runner-controller/issues/288"
			runnerCreationTimestamp %s | currentTime %s | configuredRegistrationTimeout %s`,
			*runner.Name, creationTimeStamp, currentTime, registrationTimeout,
		)
		// not registered and registration time out - runner should be marked for deletion
		return true, reason
	} else {
		// not registered but no registration timeout - do not mark for deletion
		return false, ""
	}
}
                                if registrationDidTimeout {
                                        if runner.Status.Phase == "Running" && time.Now().Before(runner.Status.Registration.ExpiresAt.Time) {
                                                log.Info("Runner " + runner.Name + " has not been found and triggered a registration timeout, however pod appears to still be running - not restarting.")
                                        } else {
                                                log.Info(reason)
                                                log.Info("Runner " + runner.Name + " has not been found and triggered a registration timeout - restarting.")
                                                restart = true
                                        }

However, this prevented the pods from properly restarting with new tokens and eventually some were stuck in a CrashLoop due to the registration tokens being invalid/expired. I am not sure yet where this condition came from....just something I noticed and had to use for POD in $(kubectl get po | grep Crash | awk -F' ' '{print $1}'); do kubectl delete po $POD; done to stop - it was worth it compared to the constant crashing of active pods though!

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 2, 2022

@jbkc85 Hey! Thank you so much for your detailed feedback. I'm still carefully comparing what's implemented in code and your points, but as a starter-

Honestly it looks like there is duplicate logic in all three of these controllers

runnerreplicaset_controller.go's registration timeout chimes in when it prioritizes seemingly timed out runners to be deleted first on scaling down the replica set. (But the runner deletion/runner agent stop is deferred until the progressing workflow job completes anyway

runner_pod_controller.go's registration timeout logic is used only for runnersets(our statefulset variant of RunnerDeployment).

That said, I think they shouldn't affect your scenario. But I'll definitely keep looking deeper so that we won't miss anything.

Please keep posting your feedbacks/concerns/etc. I highly appreciate it. Thank you!

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 2, 2022

the possibility of registration breaking due to not refreshing the token

@jbkc85 This can be a bug that is unrelated to your change!

It's intended to refresh the runner registration token used by the runner agent's config.sh every 24h(but we don't control the exact TTL, I just remember that github returned us a token valid for 24h when I last checked) by comparing and updating the registration token stored in runner.status and propagating it to runner pod's envvar:

https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/runner_controller.go#L752-L753

and finally comparing the pod template hash stored in the pod annotation to see if it needs to recreate the pod(to reflect any updates to the pod):

https://github.com/actions-runner-controller/actions-runner-controller/blob/b652a8f9ae4d40d8ea525414c56ef00efe6474a2/controllers/runner_controller.go#L287-L294

The reason why I think this can be a bug is that it doesn't take the runner token into the account when calculating the pod template hash, even though the hash needs to change when then token is updated to be reflected!

https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/runner_controller.go#L624-L629

We need to add a new unit test to verify this bug first, but I'm almost confident it is.
Thank you for pointing it out!

@jbkc85
Copy link

jbkc85 commented Feb 4, 2022

im going to try and merge this into my existing hot fix code and try it out over the weekend! Currently things are working a little better - the only issue is upon reconciliation we still have runners that are busy be marked for deletion. I am starting to think this is the process:

  1. runnerreplicaset_controller.go looks for scale down opportunities
  2. runnerreplicaset_controller.go finds ephemeral pod and marks it for deletion, even though it just hasn't had a chance to re-register
  3. runner_controller.go finds the deletion metadata and scrubs it from the system

Now, on #3, the runner_controller STILL checks to see if its busy (which is another issue with GitHub API calls being spammed btw), and has some potential to remove the damage....but I have seen the controllers log something as busy, to turn around and delete it none the less which I am still trying to sort out. But, for this code, I think its a definite step in the right direction - but also needs to be included in the runnerreplicaset_controller.go!

1.6440073831827211e+09	INFO	actions-runner-controller.runner	Pod runners-cwtsk-9z9bf is not scheduled to restart.  Status - offline: false, registrationTimeout: true, hash: false, busy: false	{"runner": "default/runners-cwtsk-9z9bf"}
1.6440073927061121e+09	INFO	actions-runner-controller.runnerreplicaset	Runner runners-cwtsk-9z9bf is not busy (status not returned) - MARKED FOR DELETION	{"runnerreplicaset": "default/runners-cwtsk"}
1.6440073930604913e+09	DEBUG	events	Normal	{"object": {"kind":"RunnerReplicaSet","namespace":"default","name":"runners-cwtsk","uid":"5ab59880-f3ab-4bc6-9534-753653efa3eb","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"6142129"}, "reason": "RunnerDeleted", "message": "Deleted runner 'runners-cwtsk-9z9bf'"}
1.6440074120252006e+09	ERROR	controller.runner-controller	Reconciler error	{"reconciler group": "actions.summerwind.dev", "reconciler kind": "Runner", "name": "runners-cwtsk-9z9bf", "namespace": "default", "error": "runner is busy"}
1.6440074126694224e+09	INFO	actions-runner-controller.runnerreplicaset	Runner runners-cwtsk-9z9bf is not busy (status not returned) - MARKED FOR DELETION	{"runnerreplicaset": "default/runners-cwtsk"}
1.6440074131829991e+09	DEBUG	events	Normal	{"object": {"kind":"RunnerReplicaSet","namespace":"default","name":"runners-cwtsk","uid":"5ab59880-f3ab-4bc6-9534-753653efa3eb","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"6142129"}, "reason": "RunnerDeleted", "message": "Deleted runner 'runners-cwtsk-9z9bf'"}
1.6440074219523563e+09	ERROR	controller.runner-controller	Reconciler error	{"reconciler group": "actions.summerwind.dev", "reconciler kind": "Runner", "name": "runners-cwtsk-9z9bf", "namespace": "default", "error": "runner is busy"}

busy (status not returned) is a log I put into https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/runnerreplicaset_controller.go#L217

Thing is, after the pod was actually deleted - the runner was STILL busy...I am not entirely sure why the code would think otherwise.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 5, 2022

@jbkc85 Hey! I greatly appreciate your detailed feedback ☺️

Thing is, after the pod was actually deleted - the runner was STILL busy

That did result in some workflow jobs being canceled prematurely, right?

After rethinking all these problems we've discussed recently, I get to think if we might have spotted a fundamental issue in either ephemeral runners and/or how we handle ephemeral runners 🤔

ARC assumes an ephemeral runner's pod to be able to get terminated "at any time" as if it's running a job it should gracefully stop.

But what you observed seems the opposite- ARC deleted the runner pod that whose runner process within the runner container was about to restart, the runner agent process did restart, reporting the runner status of busy, but the process, container, and the runner pod was terminated without waiting for the runner agent process to gracefully stop.

So, there clearly seems to be a race condition between the runner pod deletion and the ephemeral runner's restarting process. Does that mean we can't expect the actions/runner process to gracefully stop at any time? Probably actions/runner doesn't support graceful-stop well if it's an ephemeral runner and/or it's about to restart?

That said, probably there might be two ways to move forward:

  1. Fix actions/runner so that it can gracefully stop while it's about to restart

  2. Workaround in ARC in two folds- (1)ARC should disable auto-update by passing --disable-autoupdate to actions/runner, so that (2)we can get rid of runsvc.sh which prevents an ephemeral runner from restarting. By preventing the ephemeral runner from restarting, we won't be affected by this race-condition.

  3. (Imperfect) Modify ARC to unregister the runner "ASAP" before deleting the runner pod. But this might still be affected by a race condition between actions/runner about to restart(say if ARC unregistered the runner after the runner agent reported "busy" status, it might fail prematurely).

2. looks good at glance, but it is a no-go for users who wants runner auto-update- removing runsvc.sh results in a outdated runner falling into an infinite loop of restarting due to auto-update, which is why we introduced runsvc.sh two years ago 😭

WDYT?

@ethomson
Copy link

ethomson commented Feb 5, 2022

  1. looks good at glance, but it is a no-go for users who wants runner auto-update

Talking to customers, most of the ephemeral users that I’m hearing from don’t want auto-updates… I’d like to start shipping the runner in a container for the k8s users so that you can update the image instead of needing to do it at runtime which I think would help with this situation. But we haven’t done that yet.

if ARC unregistered the runner after the runner agent reported "busy" status

I think that this is not possible? If you try to unregister a runner that has started running a job (has busy set to true) then your DELETE api call will fail with an HTTP 422.

@ethomson
Copy link

ethomson commented Feb 5, 2022

Also - just for another datapoint - I think that we’ve seen this even when there is no self update involved.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 6, 2022

Talking to customers, most of the ephemeral users that I’m hearing from don’t want auto-updates…

Thanks for sharing! I was wondering how we can justify not supporting auto-update. Perhaps it's ok if e.g. 20% of less users want auto-updates? 😄

I think that this is not possible? If you try to unregister a runner that has started running a job (has busy set to true) then your DELETE api call will fail with an HTTP 422.

Good to know! I'll try to reread the API reference and give it a try when possible.

Another concern for the third option is that adding a delete-runner API call on scale down might result in ARC reaching the API rate limit earlier. I think we've been already suffering from the limit so making it worth isn't the best?

Also - just for another datapoint - I think that we’ve seen this even when there is no self update involved.

Thanks again for sharing! Yeah, your observation aligns with my theory. The scenario @jbkc85 has shared doesn't seem to involve self update. The race issue might be in the actions/runner's termination process. Not yet sure if a termination, while a self-update process is in-progress, is affected or not.

@jbkc85
Copy link

jbkc85 commented Feb 7, 2022

3. (Imperfect) Modify ARC to unregister the runner "ASAP" before deleting the runner pod. But this might still be affected by a race condition between actions/runner about to restart(say if ARC unregistered the runner after the runner agent reported "busy" status, it might fail prematurely).

I actually don't mind this. I actually like the following:

  1. Introduce a deletion 'period' in which the pod is not actually scheduled to delete until either.....
    -> A. Reconciler marks the runner as a deletion candidate, so it creates a counter on the pod. Depending on configuration once pod hits X counters, the pod is in fact marked for deletion.
    -> B. Reconciler marks the runner as a deletion candidate, so it creates a timestamp of deletion. Depending on configuration once pod hits X deletion duration, the pod is in fact marked for deletion.
  2. As the reconciler marks something for deletion, it can remove a label from the pod via the API. This label can be something like 'ready' - also dependent on configuration. If it is documented appropriately, this label can prevent further jobs from being deployed.
  3. At anytime the runner is found to be 'running' and 'online' again, the counter/timer are reset. There is another counter/timer that puts the runner back into action with its tags.

I like this scenario personally, but it also depends on our ability to cache GitHub API responses for ListRunners due to their overwhelming nature as it is right now (105 runners is the most we can run, and we still hit GitHub API limitations).

@jbkc85
Copy link

jbkc85 commented Feb 7, 2022

Thanks again for sharing! Yeah, your observation aligns with my theory. The scenario @jbkc85 has shared doesn't seem to involve self update. The race issue might be in the actions/runner's termination process. Not yet sure if a termination, while a self-update process is in-progress, is affected or not.

Correct. My scenario is 100% race condition it seems, as the horizontal autoscaler automatically marks a pod as deleted and then....though it sticks around for a little bit, its eventually cleaned up before it should be. I think this can be solved by the 'counter' or 'timer' approach mentioned above.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 8, 2022

@jbkc85 Hey! Thanks a lot for your confirmation and feedback.

I don't fully understand step 2 of the ideal process you've proposed, as you'd need to remove all the runner labels(not pod labels) to prevent github from schedullilng any workflow job onto it. Also, the default self-hosted runner label can't be removed, AFAIK, which implies we can't fully prevent jobs from scheduling onto a runner by altering labels. In addition to that, I thought there wasn't a github API to update runner labels.

That said, I think I got the gist of your idea. I suppose what's you're trying to do with step 2 is basically "stop scheduling any jobs onto the runner first, before trying to delete the runner", right? Then we can probably spam delete runner API calls instead, assuming the API call only succeeds if and only if the runner is not running any job. I believe that's @ethomson's idea #1085 (comment), too.

Step 1 and 3 are interesting. Your idea on those two steps is to delete the runner only after some "grace period", right? In combination with step 2, it would give us something like "stop scheduling any jobs onto the runner, wait for some grace period, and delete the runner only after that".

My understanding is that we might need either step 2 only, or step 1+3, not both. But I'd definitely put more thinking onto your idea. Thanks again for your feedback!

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 8, 2022

@jbkc85 In relation to my take of "atomic deletion of a runner" #1087 (comment), I believe either

can be a good workaround to simulate atomic deletion of the runner.

We might need to poll list-runners and delete-runner API very often and that's why I mark the former option a workaround.
Similary, the second option trade-off another thing(auto-update) and that's why I mark the latter option a workaround, too.

That said, I still think my 1st option #1085 (comment) is the ideal solution. But if it turned out not feasible, the only way forward would be either of them.

@jbkc85
Copy link

jbkc85 commented Feb 8, 2022

I don't fully understand step 2 of the ideal process you've proposed, as you'd need to remove all the runner labels(not pod labels) to prevent github from schedullilng any workflow job onto it. Also, the default self-hosted runner label can't be removed, AFAIK, which implies we can't fully prevent jobs from scheduling onto a runner by altering labels. In addition to that, I thought there wasn't a github API to update runner labels.

https://docs.github.com/en/rest/reference/actions#remove-a-custom-label-from-a-self-hosted-runner-for-an-organization - all we have to do is add an option to allow for the removal of a specific label and document the ability to do so. I am not sure its the ideal solution, but I do like the steps as mentioned before....put a runner in a grace period for deletion, remove runner from being a scheduled resource, and then delete it after the grace period.

Step 1 and 3 are interesting. Your idea on those two steps is to delete the runner only after some "grace period", right? In combination with step 2, it would give us something like "stop scheduling any jobs onto the runner, wait for some grace period, and delete the runner only after that".

Yes, a grace period almost exactly like the period of time the ACR uses for RegistrationDidTimeout period. Once again, I think it would work really well with removing a label.

We might need to poll list-runners and delete-runner API very often and that's why I mark the former option a workaround.

So I originally submitted a PR from my own personal repo, but I am using the ACR at my current employment and started a forked repository there. In the PR I actually tried to abstract the entire GitHub runner calls outside of all controllers by using a runner_cache of sorts. We can eliminate a LOT of calls to the API by using this type of concept and its a very simple addition from the looks of it: https://github.com/INNOSOLProject/actions-runner-controller/pull/1. I am still working through a few things which is why I haven't re-submitted a PR to the project.

@ethomson
Copy link

Hey @mumoshu - curious what you're thinking about how we should solve this moving forward. 🤔

@jbkc85
Copy link

jbkc85 commented Feb 14, 2022

@mumoshu - a few updates:

  1. https://github.com/INNOSOLProject/actions-runner-controller/pull/1 was implemented where I work and it REALLY reduces the amount of GitHub calls we make. As an added side-effect, it actually gives runners a little bit of time as they are recycling (ephemeral runtimes) and can avoid a race condition where a runner is going through a restart when the reconciler fires away to try and delete them. If I missed anything in there I apologize - Ive been working with a few different branches of code trying to keep the PR clean!
  2. https://github.com/INNOSOLProject/actions-runner-controller/pull/2 was implemented (along with no. 1) and has worked out DECENTLY well. This gives all runners a counter of 5 (easily made configurable) prior to being deleted. At anytime during this reconciliation, if the deletion is no longer 'expected' due to IsBusy or whatever events from the GitHub API, the deletion candidate is removed from the cache.

Both of these have worked decently well, though have not resolved all of our issues just yet. I am continuing to monitor, but with no. 1 we have eliminated a lot of our GitHub API limitations, and no. 2 has increased the stability of our runners - and I plan on testing it out with a much smaller scale down period this week!

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 15, 2022

@jbkc85 Thank you so much for your efforts!

I have been busy testing #1062 and I'm planning to work on this next.

Thanks to all your awesome works, I think the only missing piece is:

an option to allow for the removal of a specific label and document the ability to do so

this one, right?

@jbkc85
Copy link

jbkc85 commented Feb 15, 2022

an option to allow for the removal of a specific label and document the ability to do so

this one, right?

Right. I do wonder how this is going to interact with the GitHub runner though - I think it will work as intended but it will need a lot of testing to see if its truly not going to interrupt anything going on!

Also on another note: https://github.com/INNOSOLProject/actions-runner-controller/pull/2 doesn't work as well as intended. The reconciliation process for the runnerreplicaset_controller.go happens more sporadically than expected, and I swapped the code to use a deletion timestamp which works much like the registrationDidTimeout timestamp.

		for i := 0; i < n; i++ {
			if deletionCandidate, found := r.DeletionCache[deletionCandidates[i].Name]; found {
				deletionTimer := 20 * time.Minute

				if time.Now().After(deletionCandidate.FirstAdded.Add(deletionTimer)) {
					if err := r.Client.Delete(ctx, &deletionCandidates[i]); client.IgnoreNotFound(err) != nil {
						log.Error(err, "Failed to delete runner resource")

						return ctrl.Result{}, err
					}

					r.Recorder.Event(&rs, corev1.EventTypeNormal, "RunnerDeleted", fmt.Sprintf("Deleted runner '%s'", deletionCandidates[i].Name))
					log.Info(fmt.Sprintf("[DELETION_CANDIDATE] Deleted runner '%s'", deletionCandidates[i].Name))
					delete(r.DeletionCache, deletionCandidates[i].Name)
				} else {
					log.Info(fmt.Sprintf("[DELETION_CANDIDATE] Runner '%s' is scheduled for deletion at %s", deletionCandidates[i].Name, deletionCandidate.FirstAdded.Add(deletionTimer)))
				}
			} else {
				r.DeletionCache[deletionCandidates[i].Name] = RRSReconcilerDeletionCache{
					FirstAdded: metav1.Now(),
					Time:       metav1.Now(),
				}
			}
		}

I expect to report a few results here in a bit once I have this code pushed up and running!

@jbkc85
Copy link

jbkc85 commented Feb 15, 2022

I expect to report a few results here in a bit once I have this code pushed up and running!

PS: I do realize this is duplication of code - I was trying to make things look pretty to get a test candidate out there...and here we are haha. Its out in 'production' right now so I am still waiting on the results.

@jbkc85
Copy link

jbkc85 commented Feb 15, 2022

@mumoshu current GitHub code:

type RunnerLabelResponse struct {
	Count  int                   `json:"total_count"`
	Labels []github.RunnerLabels `json:"labels"`
}

// RemoveRunnerLabel removes an existing tag off of a GitHub runner
func (c *Client) RemoveRunnerLabel(ctx context.Context, repo string, runner_id int64, label string) error {
	// func (c *Client) NewRequest(method, urlStr string, body interface{}) (*http.Request, error) {
	owner, repository, err := splitOwnerAndRepo(repo)
	if err != nil {
		return fmt.Errorf("unable to split owner/repo from %s", repo)
	}

	uri := fmt.Sprintf(
		"/repos/%s/%s/actions/runners/%d/labels/%s",
		owner, repository, runner_id, label,
	)
	req, err := c.NewRequest(http.MethodDelete, uri, nil)
	if err != nil {
		return err
	} else {
		response := RunnerLabelResponse{}
		_, err = c.Do(ctx, req, response)
		if err != nil {
			return err
		}
	}
	return nil
}

They dont have a specific call for GitHub runners, so the above is what I am trying. Right now in the testing I am defaulting to a hard coded label value. Any thoughts so far?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 16, 2022

Also on another note: https://github.com/INNOSOLProject/actions-runner-controller/pull/2 doesn't work as well as intended. The reconciliation process for the runnerreplicaset_controller.go happens more sporadically than expected, and I swapped the code to use a deletion timestamp which works much like the registrationDidTimeout timestamp.

Ah, yes. runnerreplicaset controller triggers a reconcile on a runnerreplicaset resource whenever it or one of its child resources have any changes(this is how a K8s controller generally works), or the sync period has passed(You can configure this once via --sync-period flag of ARC controller-manager).

So counting the number of reconciliations doesn't result in reliably delaying some operation. The way you used FirstAdded to implement a timeout does look familiar to me. I think I've used similar patterns everywhere throughout ARC's code-base.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Feb 16, 2022

They dont have a specific call for GitHub runners, so the above is what I am trying. Right now in the testing I am defaulting to a hard coded label value. Any thoughts so far?

Looks good so far!

To be extra sure- where are you going to hook that RemoveRunnerLabel function into? runnerreplicaset-controller, or runner-controller, or both?

My theory was, as early as you remove labels from a non-busy runner(or unregister it, whatever works is ok), the runner is less likely to race for accepting workflow jobs before the pod is finally terminated, which results in the runner pod can be terminated safely and earlier. So hooking your code into to runnerreplicaset_controller seems right at glance.

But there may be some folks who use Runner directly, not via RunnerDeployment (and RunnerReplicaSet that is managed by a deployment). So we still need to hook it into runner_controller for them. A nervous way would be to duplicate it between runner and runnerreplicaset controllers.

I hope we can implement it in a way so that the runnerreplicaset controller just notify the runner controller through a runner resource annotation or field.

Once notified, the runner controller immediately checks for the runner status, and if and only if it's not busy it removes labels / unregisters the runner, and starts a timer for the pod deletion. This way runnerreplicaset controller doesn't need to know about how to check runner status, remove labels, etc. Thoughts?

@jbkc85
Copy link

jbkc85 commented Feb 16, 2022

So counting the number of reconciliations doesn't result in reliably delaying some operation. The way you used FirstAdded to implement a timeout does look familiar to me. I think I've used similar patterns everywhere throughout ARC's code-base.

100% - thats where I copied it from! I figured it was a good use case here, and so far has proven to be a valuable addition. There are still a few cancellations and race conditions - but I am trying to work through them.

Once notified, the runner controller immediately checks for the runner status, and if and only if it's not busy it removes labels / unregisters the runner, and starts a timer for the pod deletion. This way runnerreplicaset controller doesn't need to know about how to check runner status, remove labels, etc. Thoughts?

I think thats a better path actually. Have a 'deletionTimestamp' field that the runner_controller looks through would be perfect to avoid duplication of code and centralizing the 'deletion' of an asset on one controller. I started in the runner replica simply because thats what we are using and thats what I knew - but I wouldn't mind the code moving to a more proper place. My original thought would be adding a lot of this to the runner resource - including the GitHub status (IsBusy, IsOffline, etc) simply because it shouldn't need to be checked by other controllers - they should just pull in the runner resource and verify there. I just haven't gotten to that because I am wasn't 100% familiar with how the k8s operator framework works, nor how to add annotations appropriately to the models.

@jbkc85
Copy link

jbkc85 commented Feb 16, 2022

my thoughts for runner pod additions:

status:
  deletionScheduled: "2022-02-16T15:00:00Z"
  github:
    busy: true
    offline: false
    lastCheckTime: "2022-02-16T12:45:36Z"
  lastRegistrationCheckTime: "2022-02-16T13:05:36Z"
  phase: Running
  registration:
    expiresAt: "2022-02-16T15:27:01Z"
    labels:
    - java
    - linux
    - eks
    - self-hosted
    organization: INNOSOLProject
    token:

deletionScheduled can be an IsZero() time value to designate a non-deletion ready pod. Github should reflect all the status updates from the GitHub API that we rely on through out the ARC - meaning that we have only one controller that updates these, and the other controllers simply investigate the runner's status .... not that of GitHub. It would be EVEN better if we can derive our own metrics/attributes from the pod to cross reference with GitHub overtime (mayhaps a log stream? Not sure...) just in case - but I am not sure thats possible.

EDIT: deletionScheduled might be a two parter - do we want to include the controller in the status that is prompting a deletion of the runner?

mumoshu added a commit that referenced this pull request Feb 19, 2022
Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated.

This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens.

Ref #1085 (comment)
@mumoshu mumoshu force-pushed the master branch 2 times, most recently from ac017f0 to 25570a0 Compare March 3, 2022 02:05
@stale
Copy link

stale bot commented Apr 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 2, 2022
@github-actions github-actions bot closed this Apr 13, 2022
@Link- Link- deleted the fix-ephemeral-runner-unexpected-restart branch March 13, 2023 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants