Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load data from intermediate after processing? #517

Closed
jmrichardson opened this issue Sep 18, 2020 · 4 comments
Closed

Load data from intermediate after processing? #517

jmrichardson opened this issue Sep 18, 2020 · 4 comments

Comments

@jmrichardson
Copy link

Hi, I am new to Kedro and have been looking through the documentation and can't find a reference for automatically loading the intermediate (already processed dataset) vs processing each time I run a pipeline. In other words, I would like to pre process a file, save to intermediate location:

kibot_minute_ibm:
  type: pandas.CSVDataSet
  filepath: data/01_raw/kibot/minute/ibm.csv

X_trn:
  type: pickle.PickleDataSet
  filepath: data/02_intermediate/X_trn.pkl

X_tst:
  type: pickle.PickleDataSet
  filepath: data/02_intermediate/X_tst.pkl

The above does that, but the next time I run "kedro run" it does the whole pipeline again even though the original source data file hasn't changed. Is there a way to enable caching when node hasn't changed and the data itself hasn't changed?

@mzjp2
Copy link
Contributor

mzjp2 commented Sep 18, 2020

So I believe that at the moment, this isn't something supported (at least out of the box) with Kedro. There is CachedDataSet, but that is for caching within a given Kedro run, not between. A common pattern here, which you might like to adopt is having two (or more!) pipelines:

  • raw_to_intermediary which is a pipeline whose first node(s) takes raw datasets and last node(s) output intermediary ones, saving them to disk.
  • process_intermediary which is a pipeline whose first node(s) take intermediary datasets and process them.

Then kedro run --pipeline=raw_to_intermediary processes the raw data and saves the intermediary data to disk. You can then experiment within the process_intermediary pipeline and do kedro run --pipeline=process_intermediary to only run the second pipeline, using the saved data from disk as input, without running the raw_to_intermediary pipeline.

You can then do kedro run --pipeline=raw_to_intermediary whenever the source file changes, or you make a change to the code within that pipeline. It's a bit of a manual solution, but it works nonetheless.

@Minyus
Copy link
Contributor

Minyus commented Sep 19, 2020

Hi @jmrichardson,

Unfortunately, this feature hasn't been supported by Kedro's high level API (Kedro context or CLI) although several Kedro users have requested:

#30 @gotin
#55 @Minyus
#60 @Minyus
#82 @anuarora1990

I have seen 3 approaches by Kedro users.

  1. Since this feature is implemented by Kedro's low level API (runner) as run_only_missing, I (@Minyus) implemented this feature in the custom Kedro context in my PipelineX package at:
    https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/framework/context/flexible_run_context.py#L131

  2. @miyamonz posted a great suggestion so users can add the feature easily.
    override runner.run line in KedroContext run method #509 @miyamonz

  3. @deepyaman implemented TeePlugin:
    Run pipeline without reading from intermediate datasets #420 @deepyaman

Hope Kedro supports this feature as other tools such as Spotify's Luigi do.

@921kiyo
Copy link
Contributor

921kiyo commented Oct 2, 2020

There's an ongoing discussion in https://discourse.kedro.community/t/speeding-up-pipeline-processing-with-change-detection/90 for how this could be supported.

@stale
Copy link

stale bot commented Apr 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants