Load data from intermediate after processing? #517

jmrichardson · 2020-09-18T18:04:51Z

Hi, I am new to Kedro and have been looking through the documentation and can't find a reference for automatically loading the intermediate (already processed dataset) vs processing each time I run a pipeline. In other words, I would like to pre process a file, save to intermediate location:

kibot_minute_ibm:
  type: pandas.CSVDataSet
  filepath: data/01_raw/kibot/minute/ibm.csv

X_trn:
  type: pickle.PickleDataSet
  filepath: data/02_intermediate/X_trn.pkl

X_tst:
  type: pickle.PickleDataSet
  filepath: data/02_intermediate/X_tst.pkl

The above does that, but the next time I run "kedro run" it does the whole pipeline again even though the original source data file hasn't changed. Is there a way to enable caching when node hasn't changed and the data itself hasn't changed?

The text was updated successfully, but these errors were encountered:

mzjp2 · 2020-09-18T22:45:29Z

So I believe that at the moment, this isn't something supported (at least out of the box) with Kedro. There is CachedDataSet, but that is for caching within a given Kedro run, not between. A common pattern here, which you might like to adopt is having two (or more!) pipelines:

raw_to_intermediary which is a pipeline whose first node(s) takes raw datasets and last node(s) output intermediary ones, saving them to disk.
process_intermediary which is a pipeline whose first node(s) take intermediary datasets and process them.

Then kedro run --pipeline=raw_to_intermediary processes the raw data and saves the intermediary data to disk. You can then experiment within the process_intermediary pipeline and do kedro run --pipeline=process_intermediary to only run the second pipeline, using the saved data from disk as input, without running the raw_to_intermediary pipeline.

You can then do kedro run --pipeline=raw_to_intermediary whenever the source file changes, or you make a change to the code within that pipeline. It's a bit of a manual solution, but it works nonetheless.

Minyus · 2020-09-19T01:23:25Z

Hi @jmrichardson,

Unfortunately, this feature hasn't been supported by Kedro's high level API (Kedro context or CLI) although several Kedro users have requested:

#30 @gotin
#55 @Minyus
#60 @Minyus
#82 @anuarora1990

I have seen 3 approaches by Kedro users.

Since this feature is implemented by Kedro's low level API (runner) as run_only_missing, I (@Minyus) implemented this feature in the custom Kedro context in my PipelineX package at:
https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/framework/context/flexible_run_context.py#L131
@miyamonz posted a great suggestion so users can add the feature easily.
override runner.run line in KedroContext run method #509 @miyamonz
@deepyaman implemented TeePlugin:
Run pipeline without reading from intermediate datasets #420 @deepyaman

Hope Kedro supports this feature as other tools such as Spotify's Luigi do.

921kiyo · 2020-10-02T11:08:34Z

There's an ongoing discussion in https://discourse.kedro.community/t/speeding-up-pipeline-processing-with-change-detection/90 for how this could be supported.

stale · 2021-04-12T15:10:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jmrichardson added the Issue: Question label Sep 18, 2020

stale bot added the stale label Apr 12, 2021

stale bot closed this as completed Apr 19, 2021

astrojuanlu mentioned this issue Jan 30, 2024

How to maintain external datasets contributions kedro-org/kedro-plugins#535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load data from intermediate after processing? #517

Load data from intermediate after processing? #517

jmrichardson commented Sep 18, 2020

mzjp2 commented Sep 18, 2020

Minyus commented Sep 19, 2020 •

edited

Loading

921kiyo commented Oct 2, 2020

stale bot commented Apr 12, 2021

Load data from intermediate after processing? #517

Load data from intermediate after processing? #517

Comments

jmrichardson commented Sep 18, 2020

mzjp2 commented Sep 18, 2020

Minyus commented Sep 19, 2020 • edited Loading

921kiyo commented Oct 2, 2020

stale bot commented Apr 12, 2021

Minyus commented Sep 19, 2020 •

edited

Loading