Refactored TokenStore garbage collection to use cron workflow #3419

pregnor · 2021-02-03T11:53:08Z

Q	A
Bug fix?	no
New feature?	no
API breaks?	no
Deprecations?	no
Related tickets	resolves #3365
License	Apache 2.0

What's in this PR?

Refactored TokenStore garbage collection to use cron workflow.

Why?

Working towards HA-readyness for Pipeline.

Additional context

Due to technical reasons I had to change some of the previously discussed implementation details and chose to change some other for maintaiability:

1. why use "activity + workflow + start with cron schedule" instead of "activity + workflow returning ContinueAsNewError + cron workflow + start with cron schedule" for history wipe and organizing?

Cadence cron workflows are currently implemented using an extra workflow option (cron schedule) at workflow start and also the difference between cron-scheduled workflow start and non-cron-scheduled workflow start is that the cron workflows always return with ContinueAsNewError result by default and thus gets scheduled again for the next iteration.

Firstly, this eliminates the explicit ContinueAsNewError returns as those would interfere with the theoretical wrapping cron workflow's schedule (the inner workflow would then be retried and the cron iteration would not finish).

Secondly, this in my opinion means that cron workflows can be regular workflows without modification and so wrapping a workflow into another (cron) workflow would be redundant, as the outer workflow would have no additional benefits (neither for child workflow running nor for cron scheduling).

2. why was the token store garbage collection moved from the Pipeline process to the worker process?

There are 2 requirements to start a cron workflow:
A, Register the workflow.
B, Schedule the cron workflow.

(Note: with the cron workflow state checking mechanism introduced here, from high availability and concurrence perspectives it is functionally equivalent to start the cron workflows either in the Pipeline process or the worker process, regardless of whether the environment contains single or multiple instances of either the Pipeline or the worker process.)

The registration part requires a Cadence worker which should happen in the worker process itself, because the worker processes initiate the Cadence workers (semantically a Cadence worker would not fit into the Pipeline process IMO).

The reason I decided to put the cron workflow start in the worker process as well is the coupling between the registration and the workflow starting.
If the registration would happen in the worker and the workflow starting would happen in the Pipeline process, then I would introduce a hidden dependency (Pipeline depending on the worker to have the workflow registered before trying to start it) which I find more difficult to satisfy and maintain.

(Note: if preferred there could be a single parent cron workflow which starts all child cron workflows doing the actual work, but I found no substantial benefit behind this centralization and it infers a significant cost, coupling together all the cron workflows which are fairly independent from each other in the current decentralized cron workflow starting implementation.)

Tests

Tested using 12 minute cron interval instead of 12 hours, with not scheduled and already scheduled and schedule changed starts, worked fine.
(Also before implementing it in Pipeline, tested the state checking and workflow manipulating generic logic in an isolated Cadence environment for easing the development efforts.)

Checklist

Implementation tested (with at least one cloud provider)
Code meets the Developer Guide
~~OpenAPI and Postman files updated (if needed)~~
~~User guide and development docs updated (if needed)~~
~~Related Helm chart(s) updated (if needed)~~

sagikazarmark · 2021-02-13T06:45:02Z

pkg/sdk/cadence/cron.go

+		TaskList:                     cronConfig.TaskListName,
+		ExecutionStartToCloseTimeout: cronConfig.ExecutionStartToCloseTimeout,
+		CronSchedule:                 cronConfig.CronSchedule,
+		Memo: map[string]interface{}{ // Note: CronSchedule is not directly retrievable (version 0.13.4-0.15.0).


How hard would it be to migrate away from this once we upgrade to a version that supports retrieving the cron schedule?

The question is: should we hold off until we upgrade?

I think it should be between easy and trivial.

Once the cron schedule becomes retrievable, only the generic CronConfiguration.WorkflowState() part needs to be adjusted to use the "official retrieval" for obtaining the original cron schedule to decide whether the schedule changed.
These implementation details are behind the generic cron configuration abstraction layer and currently used for determining the requirement for (re)scheduling, nothing would change from the perspective of the cron workflow user/initiator code.

In theory the retrieval implementation should reside next to the generic Memo collection so hopefully this only means throwing out the Memo handling code and referencing a different field from the same workflow execution description object.

Worst case we need to call a different function of the Cadence client, if the cron schedule only appears there and not in the currently used Client.DescribeWorkflowExecution() result object.

However there is no indication of this feature being implemented anywhere in the near future.

sagikazarmark · 2021-02-13T06:50:10Z

I don't mind this approach since in this case there is actually no chance of a massive history buildup. I'm not sure this will work in a scenario where you need to iterate through the complete list of clusters ever time though.

pregnor · 2021-02-13T14:58:36Z

cmd/pipeline/main.go

@@ -632,7 +632,6 @@ func main() {
 	base.GET("version", gin.WrapH(buildinfo.Handler(buildInfo)))


Starting a thread from this message

I don't mind this approach since in this case there is actually no chance of a massive history buildup. I'm not sure this will work in a scenario where you need to iterate through the complete list of clusters ever time though.

I can derive two immediate concerns from your message and example cluster list iteration case.
I'm going to attempt to address those, but if you see other possible issues with this approach, let me know as I might have misunderstood your original intent.

Issue of building up too large state information during one workflow run because of a lengthy cluster list and/or many activities in a cron workflow.

Issue of recording current state in a (partitioned) single cron iteration logic.

I see two methods tackling these:

A, initiating child workflows from the cron workflow to break down the cluster list size and/or activity number.

As parent and child history are separated, this can prevent the issue of too large history in the cron workflow regardless of the cluster/activity count.
The main pros of this approach are

the intuitive, easy to follow logical hierarchy of the workflows/activities,

the simple implementation (child workflows can have a parametrized generic implementation,

the support of "out-of-the-box" parallelism through child workflow futures, and

the (cron) workflow requires few-to-none cron-run-specific logic (the degree depending on the state recording solution).

In this solution the necessary input state can be carried from the cron workflow through the input arguments of the child workflows. The current (completion) state of the single cron iteration can be determined by the state of the child workflows (which could be challenging without aggregation so) for observability purposes we can introduce additional recording measures in the parent cron workflow.

B, splitting up the single cron workflow run into multiple executions (runs) by returning ContinueAsNewError from the single cron workflow run at the partitioning points of the single iteration logic in order to reduce the single iteration history size.

This approach is more fit for tackling the too many clusters in the list case, but would be possible - although not recommended - for the too many activities case. At certain points of the single iteration logic - let's say every 100 cluster - we could "reset" the history using manual workflow "reset", if we could carry over the state information.
The main pro of this approach is the reduced hierarchy and fewer workflows/activities, thus easier "overview" of all running workflows/activities in a system (one cron iteration is encapsulated in its single corresponding workflow).
IMO aside from the custom logic maintenance cost another drawback is the cumbersome nature of handling the "too many activities" case (the logical jumps become hard to follow for readers) and also the loss of parallelism (as only one instance of the cron workflow can be executed at any moment).

This method requires additional state recording measures in the cron workflow.
One solution is to use the workflow input arguments which can be modified through the ContinueAsNewError returned from the cron workflow at the partitioning points.

I personally would prefer method A, because it relies more on the higher level Cadence mechanics and relegates the lower level implementation to Cadence, supports parallelism and avoids additional custom workaround logic. I think solution A is intuitive, easier to modify or extend it in the future if needed while also have much better separation of responsibilities compared to solution B which also eases maintenance in the long run.

I agree, method A would be better, but it doesn't apply here, so there is nothing else to do in my opinion.

pepov

I also wanted to use method A for a case where we implemented the integrated service operator upgrade for all clusters for similar reasons. After a long debate we gave up and implemented a slightly modified version of approach B (return with ContinueAsNew after every iteration, not just on partitioning points).

This is just a note that now we diverged to two different directions with similar problems.

Otherwise LGTM!

pepov · 2021-02-18T21:59:34Z

For reference: #3406

pregnor added the kind/tech-debt label Feb 3, 2021

pregnor requested review from orymate and sagikazarmark February 3, 2021 11:53

pregnor self-assigned this Feb 3, 2021

pregnor force-pushed the 3365/refactor/token-store-gc-as-cron-workflow branch 2 times, most recently from f2b0b39 to 035f7db Compare February 9, 2021 10:10

sagikazarmark reviewed Feb 13, 2021

View reviewed changes

pregnor commented Feb 13, 2021

View reviewed changes

pregnor force-pushed the 3365/refactor/token-store-gc-as-cron-workflow branch from 035f7db to 8bc0027 Compare February 15, 2021 06:27

pregnor requested a review from pepov February 15, 2021 14:00

pepov approved these changes Feb 18, 2021

View reviewed changes

pregnor force-pushed the 3365/refactor/token-store-gc-as-cron-workflow branch 2 times, most recently from 61c71b2 to a4aa6e9 Compare February 25, 2021 23:43

refactor(TokenStore): used cron workflow

9996cc2

pregnor force-pushed the 3365/refactor/token-store-gc-as-cron-workflow branch from a4aa6e9 to 9996cc2 Compare February 26, 2021 09:59

pregnor merged commit 299b85d into master Feb 28, 2021

pregnor deleted the 3365/refactor/token-store-gc-as-cron-workflow branch February 28, 2021 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored TokenStore garbage collection to use cron workflow #3419

Refactored TokenStore garbage collection to use cron workflow #3419

pregnor commented Feb 3, 2021 •

edited

Loading

sagikazarmark Feb 13, 2021

pregnor Feb 13, 2021

sagikazarmark commented Feb 13, 2021

pregnor Feb 13, 2021

sagikazarmark Feb 14, 2021

pepov left a comment

pepov commented Feb 18, 2021

		@@ -632,7 +632,6 @@ func main() {
		base.GET("version", gin.WrapH(buildinfo.Handler(buildInfo)))

Refactored TokenStore garbage collection to use cron workflow #3419

Refactored TokenStore garbage collection to use cron workflow #3419

Conversation

pregnor commented Feb 3, 2021 • edited Loading

What's in this PR?

Why?

Additional context

1. why use "activity + workflow + start with cron schedule" instead of "activity + workflow returning ContinueAsNewError + cron workflow + start with cron schedule" for history wipe and organizing?

2. why was the token store garbage collection moved from the Pipeline process to the worker process?

Tests

Checklist

sagikazarmark Feb 13, 2021

Choose a reason for hiding this comment

pregnor Feb 13, 2021

Choose a reason for hiding this comment

sagikazarmark commented Feb 13, 2021

pregnor Feb 13, 2021

Choose a reason for hiding this comment

A, initiating child workflows from the cron workflow to break down the cluster list size and/or activity number.

B, splitting up the single cron workflow run into multiple executions (runs) by returning ContinueAsNewError from the single cron workflow run at the partitioning points of the single iteration logic in order to reduce the single iteration history size.

sagikazarmark Feb 14, 2021

Choose a reason for hiding this comment

pepov left a comment

Choose a reason for hiding this comment

pepov commented Feb 18, 2021

pregnor commented Feb 3, 2021 •

edited

Loading

B, splitting up the single cron workflow run into multiple executions (runs) by returning `ContinueAsNewError` from the single cron workflow run at the partitioning points of the single iteration logic in order to reduce the single iteration history size.