-
-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proper graceful shutdown settings #381
Comments
After making this change for a shorter action_runner.graceful_shutdown we still see the actions getting stuck in a running state. |
Let this be a lesson to everyone. Do NOT put inline comments in your config file:
|
Never comment anything ever. EVER. |
Whelp it turns out I STILL am not seeing graceful shutdowns. All executions immediately get abandoned. |
I can confirm turning on the service registry fixed my graceful shutdown. |
This looks like the right place to add anything in the chart: stackstorm-k8s/templates/configmaps_st2-conf.yaml Lines 18 to 21 in cc94bd0
We could do this: {{- if index .Values "redis" "enabled" }}
[coordination]
+ service_registry = True
url = redis://{{ template "stackstorm-ha.redis-password" $ }}{{ template "stackstorm-ha.redis-nodes" $ }}
{{- end }} I do not use the redis subchart, so that would not change the default for me. I pass this in via + [coordination]
+ service_registry = True
{{- if index .Values "redis" "enabled" }}
- [coordination]
url = redis://{{ template "stackstorm-ha.redis-password" $ }}{{ template "stackstorm-ha.redis-nodes" $ }}
{{- end }} Which option would you prefer? Or something else entirely? |
second option is probably best |
We have all the proper graceful shutdown settings for actionrunner and workflow, but we are still seeing action executions get stuck in a running state. At a mininum they should be abandoned per actionrunner code.
I do notice that we are performing a query
coordinator.get_members(service.encode("utf-8")).get()
and then adding to a counter to determine if we are past the expiration.
What may be happening is that if the action_runner.graceful_shutdown config is set to say 100 seconds over 100 seconds need to actually pass before the logic to abandon executions is called
This would mean that we could set the
terminationGracePeriodSeconds
to say 300 seconds longer (or some long time) than the action_runner.graceful_shutdown seconds to ensure that the pod is alive long enough to let the abandon process finish.I have made this change so we can monitor our prod cluster.
A code fix would be to use the action_runner.graceful_shutdown config to calculate and end date time and then check against that in the while loop
The text was updated successfully, but these errors were encountered: