-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to datasets & apply interface #683
Move to datasets & apply interface #683
Conversation
58b0363
to
6fdbcd5
Compare
src/fondant/pipeline/pipeline.py
Outdated
# TODO: make dataset names unique so the same operation can be applied multiple times in | ||
# a single pipeline | ||
# input_name = "-".join([dataset.name for dataset in datasets]) noqa: ERA001 | ||
# input_hash = abs(hash(input_name)) noqa: ERA001 | ||
# output_name = f"{input_hash}_{operation.name}" noqa: ERA001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still want to include this, but it requires some larger changes to the compiler tests since we can no longer use pipeline_configs.component_configs[component_name]
to check the configuration for a specific operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this, but I think we need to tackle this more thoroughly. For now I reverted the Fondant graph to not include the datasets, and just track dependencies between operations as we did before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Robbe! Left a few small comments
resources: t.Optional[Resources] = None, | ||
cache: t.Optional[bool] = True, | ||
cluster_type: t.Optional[str] = "default", | ||
client_kwargs: t.Optional[dict] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the docstrings back here and to all other use facing methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already started doing this locally 👍 Will push soon.
} | ||
return Dataset(pipeline=self, operation=operation) | ||
|
||
def read( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we missing the schema here? Same for the write
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this out for now, will add this in a separate PR. This just changes the interface, but everything still works like before. At this stage, you still need to overwrite the component spec for generic components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I can imagine it will require more fundamental changes so might be best indeed to leave it to a separate PR
5bb1725
to
d43a474
Compare
1048271
into
feature/dataset-apply-interface
This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality. The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same. The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset. I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code. **Old interface** ```Python from fondant.pipeline import ComponentOp, Pipeline pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) load_op = ComponentOp( component_dir="load_data", arguments={...}, ) caption_op = ComponentOp.from_registry( name="caption_images", arguments={...}, ) embed_op = ComponentOp( component_dir="embed_text", arguments={...}, ) write_op = ComponentOp.from_registry( name="write_to_hf_hub", arguments={...}, ) pipeline.add_op(load_op) pipeline.add_op(caption_op, dependencies=[load_op]) pipeline.add_op(embed_op, dependencies=[caption_op]) pipeline.add_op(write_op, dependencies=[embed_op]) ``` **New interface** ```Python pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) dataset = pipeline.read( "load_data", arguments={...}, ) dataset = dataset.apply( "caption_images", arguments={...}, ) dataset = dataset.apply( "embed_text", arguments={...}, ) dataset.write( "write_to_hf_hub", arguments={...}, )
This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality. The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same. The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset. I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code. **Old interface** ```Python from fondant.pipeline import ComponentOp, Pipeline pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) load_op = ComponentOp( component_dir="load_data", arguments={...}, ) caption_op = ComponentOp.from_registry( name="caption_images", arguments={...}, ) embed_op = ComponentOp( component_dir="embed_text", arguments={...}, ) write_op = ComponentOp.from_registry( name="write_to_hf_hub", arguments={...}, ) pipeline.add_op(load_op) pipeline.add_op(caption_op, dependencies=[load_op]) pipeline.add_op(embed_op, dependencies=[caption_op]) pipeline.add_op(write_op, dependencies=[embed_op]) ``` **New interface** ```Python pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) dataset = pipeline.read( "load_data", arguments={...}, ) dataset = dataset.apply( "caption_images", arguments={...}, ) dataset = dataset.apply( "embed_text", arguments={...}, ) dataset.write( "write_to_hf_hub", arguments={...}, )
This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality.
The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same.
The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset.
I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code.
Old interface
New interface