Implementation of new pipeline interface #665

mrchtr · 2023-11-22T16:07:43Z

Draft implementation of the new pipeline interface:

modified pipeline.py (introduce new interface itself)
renaming dataset accordingly to the defined consumes and produces
introduce schema which can be used inside of generic components

Note: Didn't fixed the test yet. Added a dummy test just to visualise how the pipeline could look like. First wanted to check if this goes into the right direction.

Co-authored-by: Robbe Sneyders <[email protected]>

…ctore-component-package

Co-authored-by: Philippe Moussalli <[email protected]>

…ponent-package

PhilippeMoussalli

Thanks Matthias! I think this is headed in the right direction :)
Left some comments

components/embed_text/fondant_component.yaml

PhilippeMoussalli · 2023-11-23T09:19:24Z

src/fondant/pipeline/pipeline.py

+                                                schema=schema)
+
+        self.add_op(component_op)
+        return self


General question

Do we need to return the class instance. So instead of this:

pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) dataset = pipeline.read( name="load_images", schema={ "image": type.binary # or pa.binary() } )

We would have:

pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) pipeline.read( name="load_images", schema={ "image": type.binary # or pa.binary() } )

A bit similar to the the way we used to add operations to the pipeline. I think both are valid but not sure which one is more intuitive

The idea behind returning the class instance was that to call apply on a dataset afterwards.
Returning the class instance gives the user the most flexibility. The user can decide which variant he wants to implement.

Either:

dataset = pipeline.read(...) dataset = dataset.apply(...)

or

pipeline.read(...) dataset.apply(...)

or even:

pipeline.read(...).apply(...).apply(...)

ok that makes sense :)

Also, the Dataset class can provide an interface for more interactive data exploration.

src/fondant/pipeline/pipeline.py

PhilippeMoussalli · 2023-11-23T09:44:13Z

src/fondant/pipeline/pipeline.py

+              cluster_type: t.Optional[str] = "default",
+              client_kwargs: t.Optional[dict] = None,
+              resources: t.Optional[Resources] = None,
+              consumes: t.Optional[t.Dict[str, str]] = None,


I would add return an error if both consumes and produces are None

I would use the default values of the component specs if they are None.

PhilippeMoussalli · 2023-11-23T09:44:45Z

src/fondant/pipeline/pipeline.py

+             cluster_type: t.Optional[str] = "default",
+             client_kwargs: t.Optional[dict] = None,
+             resources: t.Optional[Resources] = None,
+             schema: t.Optional[t.Dict[str, str]] = None) -> "Pipeline":


The schema here is mandatory I suppose

PhilippeMoussalli · 2023-11-23T09:45:30Z

src/fondant/pipeline/pipeline.py

+              client_kwargs: t.Optional[dict] = None,
+              resources: t.Optional[Resources] = None,
+              consumes: t.Optional[t.Dict[str, str]] = None,
+              schema: t.Optional[t.Dict[str, str]] = None) -> "Pipeline":


Same comment as above, is the schema here just needed for renaming or does it also specify which fields to write?

The schema should specify which fields and belonging types will be written.

PhilippeMoussalli · 2023-11-23T09:46:01Z

src/fondant/pipeline/pipeline.py

Would docstring all the read/write/apply functions since they're user facing

...pipeline/examples/pipelines/valid_pipeline/example_1/fourth_component/fondant_component.yaml

First PR related to the data structure redesign. Implements the following: - New manifest structure (including validation, and evolution) - New ComponentSpec structure (including validation) - Removes `Subsets` and `Index` Not all tests are running successfully. But this are already quite a few changes. Therefore, I've created PR on feature branch `feature/redesign-dataset-format-and-interface`, to have quicker feedback loops. --------- Co-authored-by: Robbe Sneyders <[email protected]> Co-authored-by: Philippe Moussalli <[email protected]>

Refactor component package as part of #643 --------- Co-authored-by: Robbe Sneyders <[email protected]> Co-authored-by: Philippe Moussalli <[email protected]>

This PR applies the usage of the new data format: - fixes all tests - update component specifications and component code - remove subset field usage in `pipeline.py` --------- Co-authored-by: Robbe Sneyders <[email protected]>

…w-pipeline-interface

src/fondant/component/executor.py

PhilippeMoussalli · 2023-11-24T14:00:41Z

src/fondant/pipeline/pipeline.py

+        cluster_type: t.Optional[str] = "default",
+        client_kwargs: t.Optional[dict] = None,
+        resources: t.Optional[Resources] = None,
+        schema: t.Dict[str, str],


Shouldn't this schema be Dict[str, Type]? I would add some example on how this would look like in the docstring so the users know what to input and mainly because it's different from the write schema (this one would also benefit from having examples in the description), could you also move it to the top since it's mandatory?

PhilippeMoussalli · 2023-11-24T14:02:47Z

src/fondant/pipeline/pipeline.py

+        cluster_type: t.Optional[str] = "default",
+        client_kwargs: t.Optional[dict] = None,
+        resources: t.Optional[Resources] = None,
+        consumes: t.Optional[t.Dict[str, str]] = None,


does the write requires a consumes?

This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality. The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same. The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset. I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code. **Old interface** ```Python from fondant.pipeline import ComponentOp, Pipeline pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) load_op = ComponentOp( component_dir="load_data", arguments={...}, ) caption_op = ComponentOp.from_registry( name="caption_images", arguments={...}, ) embed_op = ComponentOp( component_dir="embed_text", arguments={...}, ) write_op = ComponentOp.from_registry( name="write_to_hf_hub", arguments={...}, ) pipeline.add_op(load_op) pipeline.add_op(caption_op, dependencies=[load_op]) pipeline.add_op(embed_op, dependencies=[caption_op]) pipeline.add_op(write_op, dependencies=[embed_op]) ``` **New interface** ```Python pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) dataset = pipeline.read( "load_data", arguments={...}, ) dataset = dataset.apply( "caption_images", arguments={...}, ) dataset = dataset.apply( "embed_text", arguments={...}, ) dataset.write( "write_to_hf_hub", arguments={...}, )

RobbeSneyders · 2023-11-28T17:55:57Z

Closing in favor of #685

This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality. The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same. The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset. I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code. **Old interface** ```Python from fondant.pipeline import ComponentOp, Pipeline pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) load_op = ComponentOp( component_dir="load_data", arguments={...}, ) caption_op = ComponentOp.from_registry( name="caption_images", arguments={...}, ) embed_op = ComponentOp( component_dir="embed_text", arguments={...}, ) write_op = ComponentOp.from_registry( name="write_to_hf_hub", arguments={...}, ) pipeline.add_op(load_op) pipeline.add_op(caption_op, dependencies=[load_op]) pipeline.add_op(embed_op, dependencies=[caption_op]) pipeline.add_op(write_op, dependencies=[embed_op]) ``` **New interface** ```Python pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) dataset = pipeline.read( "load_data", arguments={...}, ) dataset = dataset.apply( "caption_images", arguments={...}, ) dataset = dataset.apply( "embed_text", arguments={...}, ) dataset.write( "write_to_hf_hub", arguments={...}, )

mrchtr and others added 29 commits November 16, 2023 14:15

Update component spec schema validation

35338b6

Update component spec tests to validate new component spec

a269e3c

Add additional fields to json schema

ad0dab6

Update manifest json schema for validation

7b91535

Update manifest creation

5d1bf5e

Reduce PR to core module

d8ecd01

Addresses comments

12c78ca

Restructure test directory

c1cad60

Remove additional fields in common.json

fd0699c

Test structure

0f8117f

Refactor component package

7e8a1d6

Update src/fondant/core/component_spec.py

9f67c61

Co-authored-by: Robbe Sneyders <[email protected]>

Update src/fondant/core/manifest.py

40955bf

Co-authored-by: Robbe Sneyders <[email protected]>

Update src/fondant/core/component_spec.py

6b246a4

Co-authored-by: Robbe Sneyders <[email protected]>

Update src/fondant/core/manifest.py

8ef38d9

Co-authored-by: Robbe Sneyders <[email protected]>

Update src/fondant/core/schema.py

e8c8135

Co-authored-by: Robbe Sneyders <[email protected]>

Addresses comments

df9a60e

Addresses comments

2256118

Addresses comments

3042fb5

Update src/fondant/core/manifest.py

8fa8be7

Co-authored-by: Robbe Sneyders <[email protected]>

Addresses comments

25eb492

Merge branch 'feature/implement-new-dataset-format' into feautre/refa…

c0fb47a

…ctore-component-package

Addresses comments

0701662

Update test examples

365ca6d

Update src/fondant/core/manifest.py

4dc7dc7

Co-authored-by: Philippe Moussalli <[email protected]>

addresses comments

a60ca3e

Merge feature/implement-new-dataset-format into feature/refactore-com…

d2182a0

…ponent-package

Adjust interface for usage of produces and consumes

e141231

Adjust interface for usage of schema, consumes, and produces

f3e0a6a

mrchtr requested a review from RobbeSneyders November 22, 2023 16:07

mrchtr requested review from GeorgesLorre and PhilippeMoussalli November 22, 2023 16:07

mrchtr mentioned this pull request Nov 23, 2023

Refactor component package #654

Merged

PhilippeMoussalli reviewed Nov 23, 2023

View reviewed changes

mrchtr force-pushed the feautre/refactore-component-package branch from 734526b to d2182a0 Compare November 23, 2023 12:13

Base automatically changed from feautre/refactore-component-package to feature/redesign-dataset-format-and-interface November 23, 2023 13:47

mrchtr and others added 3 commits November 24, 2023 08:49

Refactor component package (#654)

bb3b623

Refactor component package as part of #643 --------- Co-authored-by: Robbe Sneyders <[email protected]> Co-authored-by: Philippe Moussalli <[email protected]>

Use new data format (#667)

e4eadf3

This PR applies the usage of the new data format: - fixes all tests - update component specifications and component code - remove subset field usage in `pipeline.py` --------- Co-authored-by: Robbe Sneyders <[email protected]>

RobbeSneyders force-pushed the feature/redesign-dataset-format-and-interface branch from 9f057ad to e4eadf3 Compare November 24, 2023 07:50

mrchtr linked an issue Nov 24, 2023 that may be closed by this pull request

Implement new pipeline interface #643

Closed

mrchtr added 4 commits November 24, 2023 09:10

Merge redesign-dataset-format-and-interface into feature/implement-ne…

ae72104

…w-pipeline-interface

Resolve conflicts

f0344c8

Addressing comments

826f061

Overwriting consumes and produces of component specification

4bb35a4

mrchtr commented Nov 24, 2023

View reviewed changes

src/fondant/component/executor.py Outdated Show resolved Hide resolved

Consumes and produces renaming

e7a960f

PhilippeMoussalli reviewed Nov 24, 2023

View reviewed changes

Base automatically changed from feature/redesign-dataset-format-and-interface to main November 27, 2023 09:34

Merge branch 'main' into feature/implement-new-pipeline-interface

045769f

RobbeSneyders mentioned this pull request Nov 28, 2023

Move to datasets & apply interface #683

Merged

RobbeSneyders closed this Nov 28, 2023

RobbeSneyders deleted the feature/implement-new-pipeline-interface branch January 11, 2024 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of new pipeline interface #665

Implementation of new pipeline interface #665

mrchtr commented Nov 22, 2023

PhilippeMoussalli left a comment

PhilippeMoussalli Nov 23, 2023

mrchtr Nov 24, 2023

PhilippeMoussalli Nov 24, 2023

RobbeSneyders Nov 27, 2023

PhilippeMoussalli Nov 23, 2023

mrchtr Nov 24, 2023

PhilippeMoussalli Nov 23, 2023

PhilippeMoussalli Nov 23, 2023

mrchtr Nov 24, 2023

PhilippeMoussalli Nov 23, 2023

PhilippeMoussalli Nov 24, 2023 •

edited

Loading

PhilippeMoussalli Nov 24, 2023

RobbeSneyders commented Nov 28, 2023

Implementation of new pipeline interface #665

Implementation of new pipeline interface #665

Conversation

mrchtr commented Nov 22, 2023

PhilippeMoussalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilippeMoussalli Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders commented Nov 28, 2023

PhilippeMoussalli Nov 24, 2023 •

edited

Loading