Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically skip running nodes with persisted outputs #2307

Open
jmholzer opened this issue Feb 10, 2023 · 12 comments
Open

Automatically skip running nodes with persisted outputs #2307

jmholzer opened this issue Feb 10, 2023 · 12 comments
Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: implementation Tech Design topic on implementation of the issue

Comments

@jmholzer
Copy link
Contributor

Description

Re-running nodes which have:

  1. Persisted outputs
  2. No upstream dependencies which would cause their output to change

is an unnecessary expense. It might be a good idea to have a flag which would automatically skip running these nodes.

It is currently possible to achieve this by specifying nodes to run from, though this process is manual and potentially error-prone.

Context

User @pedro-sarpen opened #2005 to address this issue, though there may be a better solution to the problem that we should investigate.

@jmholzer jmholzer added the Issue: Feature Request New feature or improvement to existing feature label Feb 10, 2023
@antonymilne
Copy link
Contributor

This is definitely something we should have, although I don't have any concrete ideas on the best way to do it off the top of my head. The broader question of "change capture" has been discussed before but I don't think anything was properly decided on. Maybe now would be the right time to re-open those discussions.

@merelcht merelcht added this to the Something about Runners milestone Feb 16, 2023
@marcosfelt
Copy link

I'd like to add that I'd love this feature. Currently, I have to comment out nodes in my pipelines and add their outputs to the inputs of the pipeline. That's really tedious and seems like an anti-pattern.

@sbrugman
Copy link
Contributor

(Our team is working on this and plan to open-source)

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Mar 13, 2023
@merelcht
Copy link
Member

Linking: #2410

@merelcht merelcht added the TD: implementation Tech Design topic on implementation of the issue label Jul 28, 2023
@astrojuanlu
Copy link
Member

xref change capture #221

To me, the main difficulty is that doing this requires making assumptions about the node functions, in particular that they're pure, i.e. that they don't have any spurious inputs, like randomness, the current date, and so on. If we assume so, then doing some sort of hashing on the inputs is technically sufficient.

As I said in #221, this would make kedro run no longer stateless.

@sbrugman
Copy link
Contributor

Related Update: our team open-sourced pycodehash just now, and are working on a Kedro runner that is able to skip cached datasets and nodes.

@astrojuanlu
Copy link
Member

@sbrugman I was having a look at PyCodeHash, looks superb!

One question: what can we do for cases like these?

def preprocess_data(df: pl.DataFrame) -> pl.DataFrame:
    now = dt.datetime.now()
    if now.minute % 2 == 0:
        raise Exception("boom")
    return df.head()

? These would effectively be cached, am I right?

@sbrugman
Copy link
Contributor

sbrugman commented Nov 29, 2023

This one is not deterministic. The random component (time) should be a parameter/dataset in order for this to work.

(Idempotent pipelines are required)

@astrojuanlu
Copy link
Member

Closed #221 as a duplicate of this one. The former is older and has some extra context.

@astrojuanlu
Copy link
Member

Previously: #30, #25, #82.

@astrojuanlu
Copy link
Member

After showing Kedro to a data scientist, this was the first thing they asked. They were familiar with DVC.

@astrojuanlu
Copy link
Member

@froxec asked for this in #4350

Description

First of all, thank you for your efforts in developing Kedro. I believe it would be highly beneficial if Kedro had a built-in node caching feature. By node caching, I mean a mechanism to avoid re-executing a node when its inputs, outputs, and logic remain unchanged.

Context

This feature is important to me because, in some scenarios, it is necessary to run the entire pipeline multiple times with different configurations. Re-executing nodes that remain unchanged between runs can significantly increase the time required for experiments.

For instance, when tracking pipeline parameters using MLFlow, we need to run the entire pipeline to record parameters for every node. This is because kedro-mlflow records parameters node by node.

Possible Implementation

There is already an existing plugin, kedro-cache, that implements similar functionality. The plugin is well-written and could work effectively with some adjustments. However, it is outdated and incompatible with the most recent Kedro releases. Moreover, there are compatibility issues with specific datasets, such as tracking.JSONDataset and tracking.MetricsDataset, which are write-only and cannot be loaded.

I believe that integrating node caching directly into Kedro's core design would help mitigate such compatibility issues and provide a more robust solution for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: implementation Tech Design topic on implementation of the issue
Projects
Status: No status
Development

No branches or pull requests

6 participants