Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental runs/"Run only missing" #221

Closed
yetudada opened this issue Feb 14, 2020 · 6 comments
Closed

Incremental runs/"Run only missing" #221

yetudada opened this issue Feb 14, 2020 · 6 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@yetudada
Copy link
Contributor

Description

We're taking the principle of Change Data Capture a step further and looking at a way for Kedro to recognise code, parameter and data changes and only re-run the sections that need to be rebuilt to affect the downstream pipelines.

You have called this run-only missing in #82 and #30, and we're finally getting smart about it.

Context

We're going to help shorten your development time when running your pipeline because you don't have to worry about re-running the entire pipeline anymore.

@yetudada
Copy link
Contributor Author

yetudada commented Mar 9, 2020

I'm going to link some of #225 to this. It had some great ideas that we could use here.

@merelcht merelcht added the pinned Issue shouldn't be closed by stale bot label Mar 31, 2021
@idanov idanov self-assigned this Jan 24, 2022
@astrojuanlu
Copy link
Member

This was brought up by a user recently (cc @pascalwhoop), but the title of the issue might make it difficult to locate. "Run only missing", "incremental runs", "change detection" could be some possible themes.

It is worth noting that to make this feasible, kedro run would need to be stateful rather than stateless. A plugin could potentially take care of that, basically what Kedro Viz does through the session store https://docs.kedro.org/en/stable/experiment_tracking/#set-up-the-session-store and using some smart hashing and/or comparing the "last modified" date with the session run date.

However, this would also move us closer to the "actually-an-orchestrator" territory, which we've been trying to avoid.

I think making kedro run smarter would be a big improvement for lots of users, but ahead of attempting this we should better understand what are the alternatives.

@noklam
Copy link
Contributor

noklam commented Sep 25, 2023

This would be useful for interactive run too. Stateful runs will also open up to a "lineage" problem. i.e. pipeline_1 create dataset_1 and pipeline_2 depend on dataset_1, is it possible to re-create the whole run history.

These are all interesting and useful features, but they are also very challenging.

@astrojuanlu
Copy link
Member

Related: https://openlineage.io/

@astrojuanlu
Copy link
Member

After reading more on data pipelines and Change Data Capture (this is the blog post that prompted me to come here https://debezium.io/blog/2018/07/19/advantages-of-log-based-change-data-capture/) I think calling this "Change Capture" is quite confusing. I will rename the issue for clarity.

@astrojuanlu astrojuanlu changed the title Change Capture Incremental runs/"Run only missing" Dec 10, 2023
@astrojuanlu astrojuanlu added Issue: Feature Request New feature or improvement to existing feature and removed pinned Issue shouldn't be closed by stale bot labels Feb 4, 2024
@astrojuanlu
Copy link
Member

This is essentially a duplicate of #2307.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

6 participants