Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix modular pipelines breaking when collapsed. #1651

Merged
merged 3 commits into from
Nov 22, 2023
Merged

Conversation

rashidakanchwala
Copy link
Contributor

@rashidakanchwala rashidakanchwala commented Nov 20, 2023

Description

Resolves #1105

Development notes

In modular pipelines, a bug was identified when the pipelines were in collapsed mode. This issue was related to the incorrect handling of datasets due to a flawed logic in defining inputs and outputs. Here's the explanation:

Inputs of a Modular Pipeline:

  • An input can either be an external source or an internal one, provided it's not also used internally as an output.
    If an internal input is simultaneously an internal output, it implies it's part of a sub-pipeline within the modular pipeline. Therefore, it should remain hidden in the collapsed view of the modular pipeline and not be displayed as an input.

Outputs of a Modular Pipeline:

  • An output is typically any external dataset used by other pipelines.
  • In the Kedro framework, an external output must be explicitly defined as an output of the modular pipeline. It doesn't have a namespace.
  • Internal outputs that also function as internal inputs are treated as components of the modular pipeline. These are not visible as outputs in the collapsed view but become visible when the pipeline is expanded.

This improved handling ensures a more accurate representation and functionality of modular pipelines, particularly in their collapsed state.

QA notes

Tested this solution on below 4 edge cases :-

  1. When a modular pipeline (external) output is used as an input to another pipeline and as an (internal) input to another function of the same modular pipeline.

def create_pipeline(**kwargs) -> Pipeline:
    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_1",
                 name="step1"),
            node(lambda x: x,
                 inputs="dataset_1",
                 outputs="dataset_2",
                 name="step2"),
            node(lambda x: x,
                 inputs="dataset_2",
                 outputs="dataset_3",
                 name="step3"),
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="dataset_out",
                 name="step4"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_out", "dataset_3"}
    )
    return new_pipeline

Before

Screenshot 2023-11-21 at 11 38 09

After

Screenshot 2023-11-21 at 11 32 48
  1. When a nested modular pipeline output is used as an input to the outer modular pipeline and also used as an input to another external modular pipeline

def create_pipeline(**kwargs) -> Pipeline:

    sub_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_1",
                 outputs="dataset_2",
                 name="step2"),
            node(lambda x: x,
                 inputs="dataset_2",
                 outputs="dataset_3",
                 name="step3"),
        ],
        inputs={"dataset_1"},
        outputs={"dataset_3"},
        namespace="sub_pipeline"
    )
    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_1",
                 name="step1"),
            sub_pipeline,
            node(lambda x: x,
                 inputs="dataset_1",
                 outputs="dataset_1_2",
                 name="step1_2"),
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="dataset_4",
                 name="step4"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_3","dataset_4"}
    )
    return new_pipeline

Before

Screenshot 2023-11-21 at 11 37 23

After

Screenshot 2023-11-21 at 11 33 46
  1. When an output of a namespace function (using node namespaces) is an input to another function in the same namespace

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=lambda dataset_1, dataset_2: (dataset_1, dataset_2),
                inputs=["dataset_1", "dataset_2"],
                outputs="dataset_3",
                name="first_node",
            ),
            node(
                func=lambda dataset_1, dataset_2: (dataset_1, dataset_2),
                inputs=["dataset_3", "dataset_4"],
                outputs="dataset_5",
                name="second_node",
            ),
            node(
                func=lambda dataset_1, dataset_2: (dataset_1, dataset_2),
                inputs=["dataset_5", "dataset_6"],
                outputs="dataset_7", 
                name="third_node",
                namespace="namespace_prefix_1",
            ),
            node(
                func=lambda dataset_1, dataset_2: (dataset_1, dataset_2),
                inputs=["dataset_7", "dataset_8"],
                outputs="dataset_9",
                name="fourth_node",
                namespace="namespace_prefix_1",
            ),
            node(
                func=lambda dataset_1, dataset_2: (dataset_1, dataset_2),
                inputs=["dataset_9", "dataset_10"],
                outputs="dataset_11",
                name="fifth_node",
                namespace="namespace_prefix_1",
            ),
        ]
    )

Before

Screenshot 2023-11-21 at 11 28 40

After

Screenshot 2023-11-21 at 11 29 59
  1. When an output of a nested modular pipeline is an input to another nested modular pipeline

def create_pipeline(**kwargs) -> Pipeline:
    data_processing_pipeline = pipeline(
        [
            node(
                lambda x: x,
                inputs=["raw_data"],
                outputs="model_inputs",
                name="process_data",
                tags=["split"],
            )
        ],
        namespace="uk.data_processing",
        outputs="model_inputs",
    )
    data_science_pipeline = pipeline(
        [
            node(
                lambda x: x,
                inputs=["model_inputs"],
                outputs="model",
                name="train_model",
                tags=["train"],
            )
        ],
        namespace="uk.data_science",
        inputs="model_inputs",
    )
    return data_processing_pipeline + data_science_pipeline

Before

Screenshot 2023-11-21 at 11 36 14

After

Screenshot 2023-11-21 at 11 35 23

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Rashida Kanchwala <[email protected]>
Signed-off-by: Rashida Kanchwala <[email protected]>
@rashidakanchwala rashidakanchwala marked this pull request as ready for review November 21, 2023 10:17
@rashidakanchwala rashidakanchwala changed the title Fix mod pipeline Fix modular pipelines breaking when collapsed. Nov 21, 2023
Copy link

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rashidakanchwala thank you! LGTM, I'm glad you managed to solve this problem at the Kedro-Viz level.

Copy link
Member

@idanov idanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super neat and Kedrific!

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solution :) I am glad the solution turns out to be simpler than we thought initially.

@rashidakanchwala rashidakanchwala merged commit 86243d0 into main Nov 22, 2023
19 checks passed
@rashidakanchwala rashidakanchwala deleted the fix-mod-pipeline branch November 22, 2023 13:54
This was referenced Dec 18, 2023
rashidakanchwala added a commit that referenced this pull request Dec 19, 2023
Release 7.0.0

Major features and improvements

Upgrade to React 18. (Migrate to React 18 #1652)
Change CLI command to run Kedro-viz tokedro viz run. (Change 'Kedro Viz' to 'Kedro Viz Run' #1671)
Add deploy command to the CLI using kedro viz deploy for sharing Kedro-viz on AWS. (AWS focussed CLI implementation for Shareable Viz  #1661)
Add support for kedro==0.19and kedro-datasets==2.0. (Fix bug on kedro viz --load-file #1677)
Drop support for python=3.7. (Remove support for Python 3.7 #1660)
Drop support for kedro==0.17.x. (Drop Kedro 17  #1669)
Bug fixes and other changes
Fix modular pipelines breaking when collapsed on the flowchart. (Fix modular pipelines breaking when collapsed.  #1651)
Display hosted URL in CLI while launching Kedro viz. (Display hosted URL in CLI while launching kedro viz #1644)
Fix Kedro-viz display on Jupyter notebooks. (Fix Kedro-viz embedded as an IFrame #1658)
Fix zoom issues on the flowchart. (Flowchart doesn't automatically reset the zoom when actions are performed. #1672)
Fix bug on kedro-viz run --load-file. (Fix bug on kedro viz --load-file #1677)
Fix bug on adding timestamps to shareable-viz. (_#1679)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants