Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Caching for non-flyte specific offloaded objects #1581

Closed
2 tasks done
kumare3 opened this issue Oct 7, 2021 · 4 comments · Fixed by flyteorg/flytepropeller#406
Closed
2 tasks done
Assignees
Labels
enhancement New feature or request flytekit FlyteKit Python related issue flytepropeller

Comments

@kumare3
Copy link
Contributor

kumare3 commented Oct 7, 2021

Motivation: Why do you think this is important?

Currently, if users use pandas.DataFrame or a pyspark.DataFrame or pander.DataFrameSchema, Flytekit simply extracts the data from the transport Literaltype.Schema. So consider the following function

@task(cache=True, cache_version="1.0")
def sub_foo(df: pandas.DataFrame) -> pandas.DataFrame:
    return df

The above function will be cached, because the input type has the file path that is cached and hence the function is not run.

But, now consider the following

@dynamic
def foo(df: pandas.DataFrame) -> pandas.DataFrame:
     return sub_foo(df=df)

This task will not be cached. This is because the dataframe is downloaded and then re-uploaded, as the underlying task transforms are not aware that the passed dataframe was not mutated. But, if FlyteSchema is used, this would work fine.

Goal: What should the final outcome look like, ideally?

Either this should work as the user expects, i.e., cache hit, else this abnormality should be documented.

Describe alternatives you've considered

NA

Propose: Link/Inline OR Additional context

No response

  • Are you sure this issue hasn't been raised already?

@kumare3 kumare3 added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers and removed untriaged This issues has not yet been looked at by the Maintainers labels Oct 7, 2021
@kumare3
Copy link
Contributor Author

kumare3 commented Oct 7, 2021

cc @eapolinario While writing the issue I realized, it is not that big a problem?

@cosmicBboy
Copy link
Contributor

I'm facing a similar dataframe caching issue in the weather forecasting project. I'm using a dynamic workflow to manage the training of a model and the tasks within it rely on training data in the form of a dataframe. Should I use FlyteSchema for this instead?

@cosmicBboy
Copy link
Contributor

solution proposal: caching for complex data types, e.g. dataframes

For blob and schema types

  1. expose a hash method in TypeTransformer, which would implement a cache-by-value system that we would implement and maintain. Users who define custom type transformers can implement a hash method too. This would enable default caching solutions for common types like pandas dataframes.
  2. in cases where a hashing function is not provided for a particular type (e.g. new_pandas_like_library.DataFrame), introduce a cache_output_fn (naming TBD) to the @task decorator like so:
@task(
    cache=True,
    cache_output_fn={
        pd.DataFrame: lambda x: hash(x),
    },
    cache_version=CACHE_VERSION,
)
def func(x: int) -> pd.DataFrame:
    df = ...  # get a dataframe
    return df

@kumare3 kumare3 added this to the 0.18.2 milestone Nov 3, 2021
@kumare3 kumare3 added flytekit FlyteKit Python related issue flytepropeller labels Nov 3, 2021
@cosmicBboy
Copy link
Contributor

Update:
Consider a mechanism that uses s3 etag (or whatever the blob storage equivalent is) as the hash metadata for a particular reference-based data artifact (dataframe, files, etc). This offloads the burden of computing that hash to blob storage instead of manually computing the hash in the container.

@eapolinario eapolinario self-assigned this Nov 10, 2021
@eapolinario eapolinario modified the milestones: 0.18.2, 0.19.0 - Eagle Dec 1, 2021
@pingsutw pingsutw changed the title [Core feature] Caching for non-flyte specific offloaded objects c Jan 26, 2022
@EngHabu EngHabu changed the title c [Core feature] Caching for non-flyte specific offloaded objects Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request flytekit FlyteKit Python related issue flytepropeller
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants