-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactored TokenStore garbage collection to use cron workflow #3419
Conversation
f2b0b39
to
035f7db
Compare
TaskList: cronConfig.TaskListName, | ||
ExecutionStartToCloseTimeout: cronConfig.ExecutionStartToCloseTimeout, | ||
CronSchedule: cronConfig.CronSchedule, | ||
Memo: map[string]interface{}{ // Note: CronSchedule is not directly retrievable (version 0.13.4-0.15.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard would it be to migrate away from this once we upgrade to a version that supports retrieving the cron schedule?
The question is: should we hold off until we upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be between easy and trivial.
Once the cron schedule becomes retrievable, only the generic CronConfiguration.WorkflowState()
part needs to be adjusted to use the "official retrieval" for obtaining the original cron schedule to decide whether the schedule changed.
These implementation details are behind the generic cron configuration abstraction layer and currently used for determining the requirement for (re)scheduling, nothing would change from the perspective of the cron workflow user/initiator code.
In theory the retrieval implementation should reside next to the generic Memo collection so hopefully this only means throwing out the Memo handling code and referencing a different field from the same workflow execution description object.
Worst case we need to call a different function of the Cadence client, if the cron schedule only appears there and not in the currently used Client.DescribeWorkflowExecution()
result object.
However there is no indication of this feature being implemented anywhere in the near future.
I don't mind this approach since in this case there is actually no chance of a massive history buildup. I'm not sure this will work in a scenario where you need to iterate through the complete list of clusters ever time though. |
@@ -632,7 +632,6 @@ func main() { | |||
base.GET("version", gin.WrapH(buildinfo.Handler(buildInfo))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting a thread from this message
I don't mind this approach since in this case there is actually no chance of a massive history buildup. I'm not sure this will work in a scenario where you need to iterate through the complete list of clusters ever time though.
I can derive two immediate concerns from your message and example cluster list iteration case.
I'm going to attempt to address those, but if you see other possible issues with this approach, let me know as I might have misunderstood your original intent.
- Issue of building up too large state information during one workflow run because of a lengthy cluster list and/or many activities in a cron workflow.
- Issue of recording current state in a (partitioned) single cron iteration logic.
I see two methods tackling these:
A, initiating child workflows from the cron workflow to break down the cluster list size and/or activity number.
As parent and child history are separated, this can prevent the issue of too large history in the cron workflow regardless of the cluster/activity count.
The main pros of this approach are
- the intuitive, easy to follow logical hierarchy of the workflows/activities,
- the simple implementation (child workflows can have a parametrized generic implementation,
- the support of "out-of-the-box" parallelism through child workflow futures, and
- the (cron) workflow requires few-to-none cron-run-specific logic (the degree depending on the state recording solution).
In this solution the necessary input state can be carried from the cron workflow through the input arguments of the child workflows. The current (completion) state of the single cron iteration can be determined by the state of the child workflows (which could be challenging without aggregation so) for observability purposes we can introduce additional recording measures in the parent cron workflow.
B, splitting up the single cron workflow run into multiple executions (runs) by returning ContinueAsNewError
from the single cron workflow run at the partitioning points of the single iteration logic in order to reduce the single iteration history size.
This approach is more fit for tackling the too many clusters in the list case, but would be possible - although not recommended - for the too many activities case. At certain points of the single iteration logic - let's say every 100 cluster - we could "reset" the history using manual workflow "reset", if we could carry over the state information.
The main pro of this approach is the reduced hierarchy and fewer workflows/activities, thus easier "overview" of all running workflows/activities in a system (one cron iteration is encapsulated in its single corresponding workflow).
IMO aside from the custom logic maintenance cost another drawback is the cumbersome nature of handling the "too many activities" case (the logical jumps become hard to follow for readers) and also the loss of parallelism (as only one instance of the cron workflow can be executed at any moment).
This method requires additional state recording measures in the cron workflow.
One solution is to use the workflow input arguments which can be modified through the ContinueAsNewError
returned from the cron workflow at the partitioning points.
I personally would prefer method A, because it relies more on the higher level Cadence mechanics and relegates the lower level implementation to Cadence, supports parallelism and avoids additional custom workaround logic. I think solution A is intuitive, easier to modify or extend it in the future if needed while also have much better separation of responsibilities compared to solution B which also eases maintenance in the long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, method A would be better, but it doesn't apply here, so there is nothing else to do in my opinion.
035f7db
to
8bc0027
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wanted to use method A for a case where we implemented the integrated service operator upgrade for all clusters for similar reasons. After a long debate we gave up and implemented a slightly modified version of approach B (return with ContinueAsNew after every iteration, not just on partitioning points).
This is just a note that now we diverged to two different directions with similar problems.
Otherwise LGTM!
For reference: #3406 |
61c71b2
to
a4aa6e9
Compare
a4aa6e9
to
9996cc2
Compare
What's in this PR?
Refactored TokenStore garbage collection to use cron workflow.
Why?
Working towards HA-readyness for Pipeline.
Additional context
Due to technical reasons I had to change some of the previously discussed implementation details and chose to change some other for maintaiability:
1. why use "activity + workflow + start with cron schedule" instead of "activity + workflow returning ContinueAsNewError + cron workflow + start with cron schedule" for history wipe and organizing?
Cadence cron workflows are currently implemented using an extra workflow option (cron schedule) at workflow start and also the difference between cron-scheduled workflow start and non-cron-scheduled workflow start is that the cron workflows always return with
ContinueAsNewError
result by default and thus gets scheduled again for the next iteration.Firstly, this eliminates the explicit
ContinueAsNewError
returns as those would interfere with the theoretical wrapping cron workflow's schedule (the inner workflow would then be retried and the cron iteration would not finish).Secondly, this in my opinion means that cron workflows can be regular workflows without modification and so wrapping a workflow into another (cron) workflow would be redundant, as the outer workflow would have no additional benefits (neither for child workflow running nor for cron scheduling).
2. why was the token store garbage collection moved from the Pipeline process to the worker process?
There are 2 requirements to start a cron workflow:
A, Register the workflow.
B, Schedule the cron workflow.
(Note: with the cron workflow state checking mechanism introduced here, from high availability and concurrence perspectives it is functionally equivalent to start the cron workflows either in the Pipeline process or the worker process, regardless of whether the environment contains single or multiple instances of either the Pipeline or the worker process.)
The registration part requires a Cadence worker which should happen in the worker process itself, because the worker processes initiate the Cadence workers (semantically a Cadence worker would not fit into the Pipeline process IMO).
The reason I decided to put the cron workflow start in the worker process as well is the coupling between the registration and the workflow starting.
If the registration would happen in the worker and the workflow starting would happen in the Pipeline process, then I would introduce a hidden dependency (Pipeline depending on the worker to have the workflow registered before trying to start it) which I find more difficult to satisfy and maintain.
(Note: if preferred there could be a single parent cron workflow which starts all child cron workflows doing the actual work, but I found no substantial benefit behind this centralization and it infers a significant cost, coupling together all the cron workflows which are fairly independent from each other in the current decentralized cron workflow starting implementation.)
Tests
Tested using 12 minute cron interval instead of 12 hours, with not scheduled and already scheduled and schedule changed starts, worked fine.
(Also before implementing it in Pipeline, tested the state checking and workflow manipulating generic logic in an isolated Cadence environment for easing the development efforts.)
Checklist
OpenAPI and Postman files updated (if needed)User guide and development docs updated (if needed)Related Helm chart(s) updated (if needed)