Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for Eviction of cached task outputs #2633

Merged

Conversation

MorpheusXAUT
Copy link
Contributor

https://hackmd.io/qOztkaj4Rb6ypodvGEowAg?view

Comments already present on the HackMD doc are from our internal team and have been left in for clarification/further discussion.

Initial discussion on Slack

Nick Müller added 2 commits June 27, 2022 18:59
Signed-off-by: Nick Müller <[email protected]>
Signed-off-by: Nick Müller <[email protected]>
@paulbes
Copy link

paulbes commented Jun 30, 2022

to perform housekeeping and clean up their cache, cleanly removing the cached data of previously executions that are no longer relevant executions

We also see a use case for this cache eviction functionality from a privacy perspective; the right to be forgotten.

GDPR (30 days to complete a deletion request):

The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay.

CCPA (45 days to complete a deletion request):

A consumer shall have the right to request that a business delete any personal information about the consumer which the business has collected from the consumer.

When processing personally identifiable information (PII) to adhere to, e.g., GDPR, we generally only retain data for 30 days in our workflows. We do this to ensure that any deletion requests from users whose data we are processing will be fulfilled "automagically" within the expected timeframe for the "right to be forgotten".

With the cache eviction API we would be able to build a system that could evict the cache for certain types of workflows or tasks within a given timeframe.

With that said, and in fear of adding scope creep to this RFC, it might be even better to have the ability to set a TTL on the cache as an attribute, e.g.,

 # Might make sense to only allow for a quite coarse-granularity
cache_ttl_hours = 30 * 24
cache_ttl_days = 30

@MorpheusXAUT
Copy link
Contributor Author

MorpheusXAUT commented Jun 30, 2022

@paulbes Interestingly enough, we talked about automatic expiration of cached values after a certain timespan internally just yesterday 😄 Our discussion was mainly focused on a housekeeping/cache size perspective rather than GDPR related, but I agree it could be useful for that as well.

With that said, and in fear of adding scope creep to this RFC, it might be even better to have the ability to set a TTL on the cache as an attribute, e.g.,

 # Might make sense to only allow for a quite coarse-granularity
cache_ttl_hours = 30 * 24
cache_ttl_days = 30

I agree that'd be great to have, but I'm not sure either if we want to include it in this RFC or keep it separate/add afterwards so we don't extend the scope too much...
If people feel we should include this addition as well, I can take another look at what we've discussed internally and adapt the RFC accordingly.

@hamersaw
Copy link
Contributor

hamersaw commented Jul 5, 2022

@paulbes Interestingly enough, we talked about automatic expiration of cached values after a certain timespan internally just yesterday smile Our discussion was mainly focused on a housekeeping/cache size perspective rather than GDPR related, but I agree it could be useful for that as well.

With that said, and in fear of adding scope creep to this RFC, it might be even better to have the ability to set a TTL on the cache as an attribute, e.g.,

 # Might make sense to only allow for a quite coarse-granularity
cache_ttl_hours = 30 * 24
cache_ttl_days = 30

I agree that'd be great to have, but I'm not sure either if we want to include it in this RFC or keep it separate/add afterwards so we don't extend the scope too much... If people feel we should include this addition as well, I can take another look at what we've discussed internally and adapt the RFC accordingly.

A few thoughts on this. I know we have discussed including a cache_expiration parameter onto cached tasks similar to the cache_version and cache_serialized that current exist. That would simply define a duration after which the cache is expired (e.x. cache_expiration=45d) and overwrite the cache if we find it beyond expiration. It doesn't sound like the goals of either use case discussed (i.e. (1) general cache cleanup to reduce cache size or (2) GDPR forget) are covered with that solution because nothing is deleted until a new task is executed to overwrite the expired value.

I think implementation of a cache eviction API does open possibilities for automated eviction. To support automated eviction it sounds like a separate service as @paulbes suggested or an additional component into one of the Flyte core services that periodically scrapes and GCs the cache is required if I am understanding this correctly? IMO it is a deep enough topic to require a separate discussion.

I think it could be a very useful feature, maybe open a new issue to track it?

@MorpheusXAUT
Copy link
Contributor Author

MorpheusXAUT commented Jul 6, 2022

A few thoughts on this. I know we have discussed including a cache_expiration parameter onto cached tasks similar to the cache_version and cache_serialized that current exist. That would simply define a duration after which the cache is expired (e.x. cache_expiration=45d) and overwrite the cache if we find it beyond expiration. It doesn't sound like the goals of either use case discussed (i.e. (1) general cache cleanup to reduce cache size or (2) GDPR forget) are covered with that solution because nothing is deleted until a new task is executed to overwrite the expired value.

I think implementation of a cache eviction API does open possibilities for automated eviction. To support automated eviction it sounds like a separate service as @paulbes suggested or an additional component into one of the Flyte core services that periodically scrapes and GCs the cache is required if I am understanding this correctly? IMO it is a deep enough topic to require a separate discussion.

@hamersaw Yes, you're correct, the addition mentioned by @paulbes (and the one we talked about internally) would require an additional service/component within Flyte to periodically check all cached entries and evict them should they have cross a certain expiry threshold.
Just adding a cache_expiration parameter to the task could also be useful, but doesn't solve the issue we're trying to solve per se.

I think it could be a very useful feature, maybe open a new issue to track it?

I agree, I'll open another issue to track this idea once the RFC gets accepted! Would definitely be a useful feature to have in Flyte itself, I'd say.

@sbrunk
Copy link
Member

sbrunk commented Jul 6, 2022

Do we see Intratask Checkpoints as a cache? If so it might make sense to include them as part of an eviction.

@MorpheusXAUT
Copy link
Contributor Author

MorpheusXAUT commented Jul 6, 2022

Do we see Intratask Checkpoints as a cache? If so it might make sense to include them as part of an eviction.

@sbrunk Good point 🤔 Yes, we should probably also clear out all these values when we're evicting the cache for a task as we might still be accessing cached values otherwise even though it looks like we should be re-computing everything.

Since at least some of the handling is done outside flyteadmin/propeller's context (inside the actual code executed), we might have to remove these values before the execution (instead of afterwards, as suggested for the cached output).
Alternatively, we could try to pass along the cache_override flag to flytekit/the Checkpointer somehow to have it skip any available entries, but I'm not sure how feasible that is...

Copy link
Contributor

@katrogan katrogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome, thank you for the write-up!

rfc/system/2633-eviction-of-cached-task-outputs.md Outdated Show resolved Hide resolved
rfc/system/2633-eviction-of-cached-task-outputs.md Outdated Show resolved Hide resolved
@MorpheusXAUT
Copy link
Contributor Author

Do we see Intratask Checkpoints as a cache? If so it might make sense to include them as part of an eviction.

@sbrunk Good point 🤔 Yes, we should probably also clear out all these values when we're evicting the cache for a task as we might still be accessing cached values otherwise even though it looks like we should be re-computing everything.

Since at least some of the handling is done outside flyteadmin/propeller's context (inside the actual code executed), we might have to remove these values before the execution (instead of afterwards, as suggested for the cached output). Alternatively, we could try to pass along the cache_override flag to flytekit/the Checkpointer somehow to have it skip any available entries, but I'm not sure how feasible that is...

@katrogan any idea/input on this? I'm honestly not quite sure how flytekit/the Checkpointer would handle this internally atm or if we could have it ignore its stored values somehow. I'd like to stay consistent with the flyteadmin/propeller behaviour though, if it all possible, just so we don't get some partially deleted cache somewhere.

@katrogan
Copy link
Contributor

katrogan commented Jul 6, 2022

cc @kumare3 for the intra-task checkpointing bits

@MorpheusXAUT
Copy link
Contributor Author

@kumare3 Any comment on the Intratask Checkpoints topic? I believe we should remove/skip those values as well, but not sure what the best way to do so would be?

@katrogan @pmahindrakar-oss I've added some of the comments from this thread to the RFC doc. Please take another look at the changes to see if they satisfy your comments/questions or if there's something we should clarify in more detail.

@hamersaw
Copy link
Contributor

@kumare3 Any comment on the Intratask Checkpoints topic? I believe we should remove/skip those values as well, but not sure what the best way to do so would be?

Intra-task checkpoints should only be applied during consecutive retries of a task within the same workflow execution. So if there was a new workflow execution with the cache_override parameter set, existing intra-task checkpoints from previous workflow executions would not affect correctness. If the concern is space and the goal is to delete the checkpoints this is another issue, and I think supporting this functionality is far beyond the scope of this proposal.

@MorpheusXAUT
Copy link
Contributor Author

Intra-task checkpoints should only be applied during consecutive retries of a task within the same workflow execution. So if there was a new workflow execution with the cache_override parameter set, existing intra-task checkpoints from previous workflow executions would not affect correctness. If the concern is space and the goal is to delete the checkpoints this is another issue, and I think supporting this functionality is far beyond the scope of this proposal.

Ah, great, didn't know about that, thanks for the info @hamersaw 👍

I agree we should keep it out of the RFC, if possible, then. Might be worth to add a follow up issue for the sake of completeness when cleaning up cached data, but otherwise we might be increasing an already quite extended scope even more.

@kumare3
Copy link
Contributor

kumare3 commented Jul 19, 2022

Checkpoints are localized to a single execution today not shared through cache - maybe we should do that in the future

@pmahindrakar-oss
Copy link
Contributor

@katrogan @pmahindrakar-oss I've added some of the comments from this thread to the RFC doc. Please take another look at the changes to see if they satisfy your comments/questions or if there's something we should clarify in more detail.

Look good to me @MorpheusXAUT for the suggested changes

katrogan
katrogan previously approved these changes Jul 26, 2022
Copy link
Contributor

@katrogan katrogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can we make sure to update to include details on recursively evicting dynamic and subworkflow nodes?

@MorpheusXAUT
Copy link
Contributor Author

LGTM. Can we make sure to update to include details on recursively evicting dynamic and subworkflow nodes?

Sure thing, will update tomorrow morning 👍

@MorpheusXAUT
Copy link
Contributor Author

@katrogan Added some details about dynamic/workflow nodes and partial failures and slightly re-arranged the doc to emphasize we prefer extending the existing endpoints/adding a new UpdateTaskExecution instead of implementing completely new/independent endpoints.

@hamersaw hamersaw self-requested a review July 27, 2022 14:32
@katrogan
Copy link
Contributor

thank you @MorpheusXAUT looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants