Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching to local execution #592

Merged
merged 27 commits into from
Aug 17, 2021
Merged

Add caching to local execution #592

merged 27 commits into from
Aug 17, 2021

Conversation

eapolinario
Copy link
Collaborator

@eapolinario eapolinario commented Aug 12, 2021

TL;DR

Enable the ability to cache the output of tasks in local executions

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

joblib.Memory provides a persistent store to cache the result of function invocations. Similarly to how we do in the hosted case, we use the pair (task inputs, cache version) to memoize task outputs across different executions.

We're also adding another command to pyflyte to help users clear the cache. In the future we might extend this to allow for the inspection of the values in the local cache.

Tracking Issue

flyteorg/flyte#761

Follow-up issue

NA

eduardo apolinario and others added 5 commits August 12, 2021 00:03
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
@welcome
Copy link

welcome bot commented Aug 12, 2021

Thank you for opening this pull request! 🙌
These tips will help get your PR across the finish line: - Most of the repos have a PR template; if not, fill it out to the best of your knowledge. - Sign off your commits (Reference: DCO Guide).

Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
flytekit/core/base_task.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 12, 2021

Codecov Report

Merging #592 (9cbc70a) into master (fc65e97) will increase coverage by 0.06%.
The diff coverage is 97.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #592      +/-   ##
==========================================
+ Coverage   85.51%   85.57%   +0.06%     
==========================================
  Files         376      379       +3     
  Lines       29311    29682     +371     
  Branches     2357     2376      +19     
==========================================
+ Hits        25064    25400     +336     
- Misses       3611     3640      +29     
- Partials      636      642       +6     
Impacted Files Coverage Δ
flytekit/clis/sdk_in_container/local_cache.py 77.77% <77.77%> (ø)
flytekit/core/local_cache.py 80.95% <80.95%> (ø)
flytekit/clis/sdk_in_container/pyflyte.py 82.45% <100.00%> (+0.63%) ⬆️
flytekit/core/base_task.py 88.26% <100.00%> (+0.26%) ⬆️
tests/flytekit/unit/core/test_local_cache.py 100.00% <100.00%> (ø)
tests/flytekit/integration/remote/test_remote.py 82.26% <0.00%> (-5.43%) ⬇️
tests/flytekit/unit/core/test_node_creation.py 95.58% <0.00%> (-2.09%) ⬇️
flytekit/remote/remote.py 72.06% <0.00%> (-2.06%) ⬇️
tests/flytekit/unit/core/test_launch_plan.py 93.96% <0.00%> (-0.72%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc65e97...9cbc70a. Read the comment docs.

Makefile Outdated Show resolved Hide resolved
@eapolinario eapolinario changed the title [wip] Add caching to local execution Add caching to local execution Aug 13, 2021
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
kumare3
kumare3 previously approved these changes Aug 13, 2021
Copy link
Contributor

@katrogan katrogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, can we add some more tests!

outputs_literal_map = self.dispatch_execute(ctx, input_literal_map)
# if metadata.cache is set, check memoized version
if self._metadata.cache:
# The cache key is composed only of '(input_literal_map, cache_version)', i.e. all other parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also include the task name as part of the key?

similar to what we do with the identifier on hosted flyte https://github.com/flyteorg/flyteplugins/blob/master/go/tasks/pluginmachinery/catalog/client.go#L26

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key is already implicitly part of the cache key (in other words, joblib.Memory only caches the result of each function call). I'm going to update the comment to reflect that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is a really good call. I'm going to add a test case to demonstrate the problem and subsequent fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by 130b432.

@@ -0,0 +1,78 @@
from pytest import fixture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test some more complex input and output types, like dataclasses, schemas, files?


@staticmethod
def initialize():
LocalCache._memory = Memory(CACHE_LOCATION, verbose=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make the verbosity configurable too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we gain a lot from making this configurable, especially given that configuration files are not used at all in local executions. I'm going to leave a TODO as well.

Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>

def test_wf_custom_types():
@dataclass_json
@dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, this works? ran into issues with dataclass serialization last time i tried this but maybe joblib has since worked around this. woohoo!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question around this, are we storing LiteralMap or are we storing the native python values. Also what happens to file refrences, do we store a shallow copy or a deep copy?

@eapolinario eapolinario merged commit 86d3368 into master Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants