RFC for Eviction of cached task outputs #2633

MorpheusXAUT · 2022-06-27T17:03:38Z

https://hackmd.io/qOztkaj4Rb6ypodvGEowAg?view

Comments already present on the HackMD doc are from our internal team and have been left in for clarification/further discussion.

Initial discussion on Slack

Signed-off-by: Nick Müller <[email protected]>

paulbes · 2022-06-30T07:38:55Z

to perform housekeeping and clean up their cache, cleanly removing the cached data of previously executions that are no longer relevant executions

We also see a use case for this cache eviction functionality from a privacy perspective; the right to be forgotten.

GDPR (30 days to complete a deletion request):

The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay.

CCPA (45 days to complete a deletion request):

A consumer shall have the right to request that a business delete any personal information about the consumer which the business has collected from the consumer.

When processing personally identifiable information (PII) to adhere to, e.g., GDPR, we generally only retain data for 30 days in our workflows. We do this to ensure that any deletion requests from users whose data we are processing will be fulfilled "automagically" within the expected timeframe for the "right to be forgotten".

With the cache eviction API we would be able to build a system that could evict the cache for certain types of workflows or tasks within a given timeframe.

With that said, and in fear of adding scope creep to this RFC, it might be even better to have the ability to set a TTL on the cache as an attribute, e.g.,

 # Might make sense to only allow for a quite coarse-granularity
cache_ttl_hours = 30 * 24
cache_ttl_days = 30

MorpheusXAUT · 2022-06-30T08:50:34Z

@paulbes Interestingly enough, we talked about automatic expiration of cached values after a certain timespan internally just yesterday 😄 Our discussion was mainly focused on a housekeeping/cache size perspective rather than GDPR related, but I agree it could be useful for that as well.

With that said, and in fear of adding scope creep to this RFC, it might be even better to have the ability to set a TTL on the cache as an attribute, e.g.,
 # Might make sense to only allow for a quite coarse-granularity
cache_ttl_hours = 30 * 24
cache_ttl_days = 30

I agree that'd be great to have, but I'm not sure either if we want to include it in this RFC or keep it separate/add afterwards so we don't extend the scope too much...
If people feel we should include this addition as well, I can take another look at what we've discussed internally and adapt the RFC accordingly.

hamersaw · 2022-07-05T23:14:54Z

@paulbes Interestingly enough, we talked about automatic expiration of cached values after a certain timespan internally just yesterday smile Our discussion was mainly focused on a housekeeping/cache size perspective rather than GDPR related, but I agree it could be useful for that as well.
With that said, and in fear of adding scope creep to this RFC, it might be even better to have the ability to set a TTL on the cache as an attribute, e.g.,
 # Might make sense to only allow for a quite coarse-granularity
cache_ttl_hours = 30 * 24
cache_ttl_days = 30
I agree that'd be great to have, but I'm not sure either if we want to include it in this RFC or keep it separate/add afterwards so we don't extend the scope too much... If people feel we should include this addition as well, I can take another look at what we've discussed internally and adapt the RFC accordingly.

A few thoughts on this. I know we have discussed including a cache_expiration parameter onto cached tasks similar to the cache_version and cache_serialized that current exist. That would simply define a duration after which the cache is expired (e.x. cache_expiration=45d) and overwrite the cache if we find it beyond expiration. It doesn't sound like the goals of either use case discussed (i.e. (1) general cache cleanup to reduce cache size or (2) GDPR forget) are covered with that solution because nothing is deleted until a new task is executed to overwrite the expired value.

I think implementation of a cache eviction API does open possibilities for automated eviction. To support automated eviction it sounds like a separate service as @paulbes suggested or an additional component into one of the Flyte core services that periodically scrapes and GCs the cache is required if I am understanding this correctly? IMO it is a deep enough topic to require a separate discussion.

I think it could be a very useful feature, maybe open a new issue to track it?

rfc/system/2633-eviction-of-cached-task-outputs.md

MorpheusXAUT · 2022-07-06T06:49:57Z

A few thoughts on this. I know we have discussed including a cache_expiration parameter onto cached tasks similar to the cache_version and cache_serialized that current exist. That would simply define a duration after which the cache is expired (e.x. cache_expiration=45d) and overwrite the cache if we find it beyond expiration. It doesn't sound like the goals of either use case discussed (i.e. (1) general cache cleanup to reduce cache size or (2) GDPR forget) are covered with that solution because nothing is deleted until a new task is executed to overwrite the expired value.

I think implementation of a cache eviction API does open possibilities for automated eviction. To support automated eviction it sounds like a separate service as @paulbes suggested or an additional component into one of the Flyte core services that periodically scrapes and GCs the cache is required if I am understanding this correctly? IMO it is a deep enough topic to require a separate discussion.

@hamersaw Yes, you're correct, the addition mentioned by @paulbes (and the one we talked about internally) would require an additional service/component within Flyte to periodically check all cached entries and evict them should they have cross a certain expiry threshold.
Just adding a cache_expiration parameter to the task could also be useful, but doesn't solve the issue we're trying to solve per se.

I think it could be a very useful feature, maybe open a new issue to track it?

I agree, I'll open another issue to track this idea once the RFC gets accepted! Would definitely be a useful feature to have in Flyte itself, I'd say.

sbrunk · 2022-07-06T11:49:43Z

Do we see Intratask Checkpoints as a cache? If so it might make sense to include them as part of an eviction.

MorpheusXAUT · 2022-07-06T12:33:21Z

Do we see Intratask Checkpoints as a cache? If so it might make sense to include them as part of an eviction.

@sbrunk Good point 🤔 Yes, we should probably also clear out all these values when we're evicting the cache for a task as we might still be accessing cached values otherwise even though it looks like we should be re-computing everything.

Since at least some of the handling is done outside flyteadmin/propeller's context (inside the actual code executed), we might have to remove these values before the execution (instead of afterwards, as suggested for the cached output).
Alternatively, we could try to pass along the cache_override flag to flytekit/the Checkpointer somehow to have it skip any available entries, but I'm not sure how feasible that is...

katrogan

this is awesome, thank you for the write-up!

rfc/system/2633-eviction-of-cached-task-outputs.md

MorpheusXAUT · 2022-07-06T19:50:08Z

Do we see Intratask Checkpoints as a cache? If so it might make sense to include them as part of an eviction.

@sbrunk Good point 🤔 Yes, we should probably also clear out all these values when we're evicting the cache for a task as we might still be accessing cached values otherwise even though it looks like we should be re-computing everything.

Since at least some of the handling is done outside flyteadmin/propeller's context (inside the actual code executed), we might have to remove these values before the execution (instead of afterwards, as suggested for the cached output). Alternatively, we could try to pass along the cache_override flag to flytekit/the Checkpointer somehow to have it skip any available entries, but I'm not sure how feasible that is...

@katrogan any idea/input on this? I'm honestly not quite sure how flytekit/the Checkpointer would handle this internally atm or if we could have it ignore its stored values somehow. I'd like to stay consistent with the flyteadmin/propeller behaviour though, if it all possible, just so we don't get some partially deleted cache somewhere.

katrogan · 2022-07-06T20:20:12Z

cc @kumare3 for the intra-task checkpointing bits

Signed-off-by: Nick Müller <[email protected]>

MorpheusXAUT · 2022-07-18T14:51:53Z

@kumare3 Any comment on the Intratask Checkpoints topic? I believe we should remove/skip those values as well, but not sure what the best way to do so would be?

@katrogan @pmahindrakar-oss I've added some of the comments from this thread to the RFC doc. Please take another look at the changes to see if they satisfy your comments/questions or if there's something we should clarify in more detail.

hamersaw · 2022-07-19T04:52:18Z

@kumare3 Any comment on the Intratask Checkpoints topic? I believe we should remove/skip those values as well, but not sure what the best way to do so would be?

Intra-task checkpoints should only be applied during consecutive retries of a task within the same workflow execution. So if there was a new workflow execution with the cache_override parameter set, existing intra-task checkpoints from previous workflow executions would not affect correctness. If the concern is space and the goal is to delete the checkpoints this is another issue, and I think supporting this functionality is far beyond the scope of this proposal.

MorpheusXAUT · 2022-07-19T04:57:23Z

Intra-task checkpoints should only be applied during consecutive retries of a task within the same workflow execution. So if there was a new workflow execution with the cache_override parameter set, existing intra-task checkpoints from previous workflow executions would not affect correctness. If the concern is space and the goal is to delete the checkpoints this is another issue, and I think supporting this functionality is far beyond the scope of this proposal.

Ah, great, didn't know about that, thanks for the info @hamersaw 👍

I agree we should keep it out of the RFC, if possible, then. Might be worth to add a follow up issue for the sake of completeness when cleaning up cached data, but otherwise we might be increasing an already quite extended scope even more.

kumare3 · 2022-07-19T15:18:42Z

Checkpoints are localized to a single execution today not shared through cache - maybe we should do that in the future

Signed-off-by: Nick Müller <[email protected]>

pmahindrakar-oss · 2022-07-26T15:45:57Z

@katrogan @pmahindrakar-oss I've added some of the comments from this thread to the RFC doc. Please take another look at the changes to see if they satisfy your comments/questions or if there's something we should clarify in more detail.

Look good to me @MorpheusXAUT for the suggested changes

katrogan

LGTM. Can we make sure to update to include details on recursively evicting dynamic and subworkflow nodes?

MorpheusXAUT · 2022-07-26T18:54:32Z

LGTM. Can we make sure to update to include details on recursively evicting dynamic and subworkflow nodes?

Sure thing, will update tomorrow morning 👍

Signed-off-by: Nick Müller <[email protected]>

MorpheusXAUT · 2022-07-27T09:05:09Z

@katrogan Added some details about dynamic/workflow nodes and partial failures and slightly re-arranged the doc to emphasize we prefer extending the existing endpoints/adding a new UpdateTaskExecution instead of implementing completely new/independent endpoints.

katrogan · 2022-07-27T21:50:01Z

thank you @MorpheusXAUT looks great!

Nick Müller added 2 commits June 27, 2022 18:59

Initial cache eviction RFC draft

ea7b0dd

Signed-off-by: Nick Müller <[email protected]>

Updated PR number for RFC doc

62650c7

Signed-off-by: Nick Müller <[email protected]>

pmahindrakar-oss reviewed Jul 6, 2022

View reviewed changes

rfc/system/2633-eviction-of-cached-task-outputs.md Show resolved Hide resolved

rfc/system/2633-eviction-of-cached-task-outputs.md Show resolved Hide resolved

katrogan reviewed Jul 6, 2022

View reviewed changes

Adapted cache eviction RFC for comments/feedback

dd8abf6

Signed-off-by: Nick Müller <[email protected]>

Adapted cache eviction RFC for comments/feedback

a3d40d4

Signed-off-by: Nick Müller <[email protected]>

katrogan previously approved these changes Jul 26, 2022

View reviewed changes

Adapted cache eviction RFC for comments/feedback

37faa75

Signed-off-by: Nick Müller <[email protected]>

MorpheusXAUT dismissed katrogan’s stale review via 37faa75 July 27, 2022 08:43

hamersaw self-requested a review July 27, 2022 14:32

hamersaw approved these changes Jul 27, 2022

View reviewed changes

katrogan approved these changes Jul 27, 2022

View reviewed changes

hamersaw merged commit 3d265f1 into flyteorg:master Jul 29, 2022

MorpheusXAUT mentioned this pull request Sep 12, 2022

[Core feature] Cache eviction override for a single execution #2867

Open

2 tasks

MorpheusXAUT mentioned this pull request Jan 7, 2023

Cache eviction of past executions flyteorg/flyteadmin#504

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC for Eviction of cached task outputs #2633

RFC for Eviction of cached task outputs #2633

MorpheusXAUT commented Jun 27, 2022

paulbes commented Jun 30, 2022

MorpheusXAUT commented Jun 30, 2022 •

edited

Loading

hamersaw commented Jul 5, 2022

MorpheusXAUT commented Jul 6, 2022 •

edited

Loading

sbrunk commented Jul 6, 2022

MorpheusXAUT commented Jul 6, 2022 •

edited

Loading

katrogan left a comment

MorpheusXAUT commented Jul 6, 2022

katrogan commented Jul 6, 2022

MorpheusXAUT commented Jul 18, 2022

hamersaw commented Jul 19, 2022

MorpheusXAUT commented Jul 19, 2022

kumare3 commented Jul 19, 2022

pmahindrakar-oss commented Jul 26, 2022

katrogan left a comment •

edited

Loading

MorpheusXAUT commented Jul 26, 2022

MorpheusXAUT commented Jul 27, 2022

katrogan commented Jul 27, 2022

RFC for Eviction of cached task outputs #2633

RFC for Eviction of cached task outputs #2633

Conversation

MorpheusXAUT commented Jun 27, 2022

paulbes commented Jun 30, 2022

MorpheusXAUT commented Jun 30, 2022 • edited Loading

hamersaw commented Jul 5, 2022

MorpheusXAUT commented Jul 6, 2022 • edited Loading

sbrunk commented Jul 6, 2022

MorpheusXAUT commented Jul 6, 2022 • edited Loading

katrogan left a comment

Choose a reason for hiding this comment

MorpheusXAUT commented Jul 6, 2022

katrogan commented Jul 6, 2022

MorpheusXAUT commented Jul 18, 2022

hamersaw commented Jul 19, 2022

MorpheusXAUT commented Jul 19, 2022

kumare3 commented Jul 19, 2022

pmahindrakar-oss commented Jul 26, 2022

katrogan left a comment • edited Loading

Choose a reason for hiding this comment

MorpheusXAUT commented Jul 26, 2022

MorpheusXAUT commented Jul 27, 2022

katrogan commented Jul 27, 2022

MorpheusXAUT commented Jun 30, 2022 •

edited

Loading

MorpheusXAUT commented Jul 6, 2022 •

edited

Loading

MorpheusXAUT commented Jul 6, 2022 •

edited

Loading

katrogan left a comment •

edited

Loading