Move to datasets & apply interface #683

RobbeSneyders · 2023-11-28T10:58:30Z

This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality.

The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same.

The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset.

I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code.

Old interface

from fondant.pipeline import ComponentOp, Pipeline


pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

load_op = ComponentOp(
    component_dir="load_data",
    arguments={...},
)

caption_op = ComponentOp.from_registry(
    name="caption_images",
    arguments={...},
)

embed_op = ComponentOp(
    component_dir="embed_text",
    arguments={...},
)

write_op = ComponentOp.from_registry(
    name="write_to_hf_hub",
    arguments={...},
)

pipeline.add_op(load_op)
pipeline.add_op(caption_op, dependencies=[load_op])
pipeline.add_op(embed_op, dependencies=[caption_op])
pipeline.add_op(write_op, dependencies=[embed_op])

New interface

pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

dataset = pipeline.read(
    "load_data",
    arguments={...},
)
dataset = dataset.apply(
    "caption_images",
    arguments={...},
)
dataset = dataset.apply(
    "embed_text",
    arguments={...},
)
dataset.write(
    "write_to_hf_hub",
    arguments={...},
)

RobbeSneyders · 2023-11-28T11:05:34Z

src/fondant/pipeline/pipeline.py

+        # TODO: make dataset names unique so the same operation can be applied multiple times in
+        #  a single pipeline
+        # input_name = "-".join([dataset.name for dataset in datasets])  noqa: ERA001
+        # input_hash = abs(hash(input_name))  noqa: ERA001
+        # output_name = f"{input_hash}_{operation.name}"  noqa: ERA001


I still want to include this, but it requires some larger changes to the compiler tests since we can no longer use pipeline_configs.component_configs[component_name] to check the configuration for a specific operation.

I looked into this, but I think we need to tackle this more thoroughly. For now I reverted the Fondant graph to not include the datasets, and just track dependencies between operations as we did before.

PhilippeMoussalli

Thanks Robbe! Left a few small comments

PhilippeMoussalli · 2023-11-28T15:03:28Z

src/fondant/pipeline/pipeline.py

+        resources: t.Optional[Resources] = None,
+        cache: t.Optional[bool] = True,
+        cluster_type: t.Optional[str] = "default",
+        client_kwargs: t.Optional[dict] = None,


Can we add the docstrings back here and to all other use facing methods?

Already started doing this locally 👍 Will push soon.

PhilippeMoussalli · 2023-11-28T15:06:22Z

src/fondant/pipeline/pipeline.py

        }
+        return Dataset(pipeline=self, operation=operation)
+
+    def read(


Aren't we missing the schema here? Same for the write function

I left this out for now, will add this in a separate PR. This just changes the interface, but everything still works like before. At this stage, you still need to overwrite the component spec for generic components.

Makes sense, I can imagine it will require more fundamental changes so might be best indeed to leave it to a separate PR

This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality. The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same. The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset. I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code. **Old interface** ```Python from fondant.pipeline import ComponentOp, Pipeline pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) load_op = ComponentOp( component_dir="load_data", arguments={...}, ) caption_op = ComponentOp.from_registry( name="caption_images", arguments={...}, ) embed_op = ComponentOp( component_dir="embed_text", arguments={...}, ) write_op = ComponentOp.from_registry( name="write_to_hf_hub", arguments={...}, ) pipeline.add_op(load_op) pipeline.add_op(caption_op, dependencies=[load_op]) pipeline.add_op(embed_op, dependencies=[caption_op]) pipeline.add_op(write_op, dependencies=[embed_op]) ``` **New interface** ```Python pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) dataset = pipeline.read( "load_data", arguments={...}, ) dataset = dataset.apply( "caption_images", arguments={...}, ) dataset = dataset.apply( "embed_text", arguments={...}, ) dataset.write( "write_to_hf_hub", arguments={...}, )

RobbeSneyders requested review from GeorgesLorre and PhilippeMoussalli November 28, 2023 10:58

RobbeSneyders force-pushed the feature/pipeline-interface branch from 58b0363 to 6fdbcd5 Compare November 28, 2023 11:02

RobbeSneyders commented Nov 28, 2023

View reviewed changes

PhilippeMoussalli reviewed Nov 28, 2023

View reviewed changes

PhilippeMoussalli approved these changes Nov 28, 2023

View reviewed changes

RobbeSneyders added 3 commits November 28, 2023 18:36

Move to datasets & apply interface

2354db6

Add operation to dataset

4045c0e

Add docstrings for public methods

d43a474

RobbeSneyders force-pushed the feature/pipeline-interface branch from 5bb1725 to d43a474 Compare November 28, 2023 17:36

RobbeSneyders changed the base branch from main to feature/dataset-apply-interface November 28, 2023 17:47

RobbeSneyders merged commit 1048271 into feature/dataset-apply-interface Nov 28, 2023
4 checks passed

RobbeSneyders deleted the feature/pipeline-interface branch November 28, 2023 17:47

RobbeSneyders mentioned this pull request Nov 29, 2023

Implement dataset & apply pipeline interface #688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to datasets & apply interface #683

Move to datasets & apply interface #683

RobbeSneyders commented Nov 28, 2023 •

edited

Loading

RobbeSneyders Nov 28, 2023

RobbeSneyders Nov 28, 2023

PhilippeMoussalli left a comment

PhilippeMoussalli Nov 28, 2023

RobbeSneyders Nov 28, 2023

PhilippeMoussalli Nov 28, 2023

RobbeSneyders Nov 28, 2023

PhilippeMoussalli Nov 28, 2023

Move to datasets & apply interface #683

Move to datasets & apply interface #683

Conversation

RobbeSneyders commented Nov 28, 2023 • edited Loading

RobbeSneyders Nov 28, 2023

Choose a reason for hiding this comment

RobbeSneyders Nov 28, 2023

Choose a reason for hiding this comment

PhilippeMoussalli left a comment

Choose a reason for hiding this comment

PhilippeMoussalli Nov 28, 2023

Choose a reason for hiding this comment

RobbeSneyders Nov 28, 2023

Choose a reason for hiding this comment

PhilippeMoussalli Nov 28, 2023

Choose a reason for hiding this comment

RobbeSneyders Nov 28, 2023

Choose a reason for hiding this comment

PhilippeMoussalli Nov 28, 2023

Choose a reason for hiding this comment

RobbeSneyders commented Nov 28, 2023 •

edited

Loading