-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of new pipeline interface #665
Conversation
Co-authored-by: Robbe Sneyders <[email protected]>
Co-authored-by: Robbe Sneyders <[email protected]>
Co-authored-by: Robbe Sneyders <[email protected]>
Co-authored-by: Robbe Sneyders <[email protected]>
Co-authored-by: Robbe Sneyders <[email protected]>
Co-authored-by: Robbe Sneyders <[email protected]>
…ctore-component-package
Co-authored-by: Philippe Moussalli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matthias! I think this is headed in the right direction :)
Left some comments
schema=schema) | ||
|
||
self.add_op(component_op) | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question
Do we need to return the class instance. So instead of this:
pipeline = Pipeline(
pipeline_name="my_pipeline",
pipeline_description="description of my pipeline",
base_path="/foo/bar",
)
dataset = pipeline.read(
name="load_images",
schema={
"image": type.binary # or pa.binary()
}
)
We would have:
pipeline = Pipeline(
pipeline_name="my_pipeline",
pipeline_description="description of my pipeline",
base_path="/foo/bar",
)
pipeline.read(
name="load_images",
schema={
"image": type.binary # or pa.binary()
}
)
A bit similar to the the way we used to add operations to the pipeline. I think both are valid but not sure which one is more intuitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind returning the class instance was that to call apply on a dataset
afterwards.
Returning the class instance gives the user the most flexibility. The user can decide which variant he wants to implement.
Either:
dataset = pipeline.read(...)
dataset = dataset.apply(...)
or
pipeline.read(...)
dataset.apply(...)
or even:
pipeline.read(...).apply(...).apply(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that makes sense :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the Dataset
class can provide an interface for more interactive data exploration.
src/fondant/pipeline/pipeline.py
Outdated
cluster_type: t.Optional[str] = "default", | ||
client_kwargs: t.Optional[dict] = None, | ||
resources: t.Optional[Resources] = None, | ||
consumes: t.Optional[t.Dict[str, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add return an error if both consumes and produces are None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the default values of the component specs if they are None.
src/fondant/pipeline/pipeline.py
Outdated
cluster_type: t.Optional[str] = "default", | ||
client_kwargs: t.Optional[dict] = None, | ||
resources: t.Optional[Resources] = None, | ||
schema: t.Optional[t.Dict[str, str]] = None) -> "Pipeline": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema here is mandatory I suppose
src/fondant/pipeline/pipeline.py
Outdated
client_kwargs: t.Optional[dict] = None, | ||
resources: t.Optional[Resources] = None, | ||
consumes: t.Optional[t.Dict[str, str]] = None, | ||
schema: t.Optional[t.Dict[str, str]] = None) -> "Pipeline": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, is the schema here just needed for renaming or does it also specify which fields to write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema should specify which fields and belonging types will be written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would docstring all the read/write/apply functions since they're user facing
...pipeline/examples/pipelines/valid_pipeline/example_1/fourth_component/fondant_component.yaml
Outdated
Show resolved
Hide resolved
734526b
to
d2182a0
Compare
First PR related to the data structure redesign. Implements the following: - New manifest structure (including validation, and evolution) - New ComponentSpec structure (including validation) - Removes `Subsets` and `Index` Not all tests are running successfully. But this are already quite a few changes. Therefore, I've created PR on feature branch `feature/redesign-dataset-format-and-interface`, to have quicker feedback loops. --------- Co-authored-by: Robbe Sneyders <[email protected]> Co-authored-by: Philippe Moussalli <[email protected]>
Refactor component package as part of #643 --------- Co-authored-by: Robbe Sneyders <[email protected]> Co-authored-by: Philippe Moussalli <[email protected]>
This PR applies the usage of the new data format: - fixes all tests - update component specifications and component code - remove subset field usage in `pipeline.py` --------- Co-authored-by: Robbe Sneyders <[email protected]>
9f057ad
to
e4eadf3
Compare
cluster_type: t.Optional[str] = "default", | ||
client_kwargs: t.Optional[dict] = None, | ||
resources: t.Optional[Resources] = None, | ||
schema: t.Dict[str, str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this schema be Dict[str, Type]
? I would add some example on how this would look like in the docstring so the users know what to input and mainly because it's different from the write
schema (this one would also benefit from having examples in the description), could you also move it to the top since it's mandatory?
cluster_type: t.Optional[str] = "default", | ||
client_kwargs: t.Optional[dict] = None, | ||
resources: t.Optional[Resources] = None, | ||
consumes: t.Optional[t.Dict[str, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the write
requires a consumes
?
This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality. The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same. The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset. I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code. **Old interface** ```Python from fondant.pipeline import ComponentOp, Pipeline pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) load_op = ComponentOp( component_dir="load_data", arguments={...}, ) caption_op = ComponentOp.from_registry( name="caption_images", arguments={...}, ) embed_op = ComponentOp( component_dir="embed_text", arguments={...}, ) write_op = ComponentOp.from_registry( name="write_to_hf_hub", arguments={...}, ) pipeline.add_op(load_op) pipeline.add_op(caption_op, dependencies=[load_op]) pipeline.add_op(embed_op, dependencies=[caption_op]) pipeline.add_op(write_op, dependencies=[embed_op]) ``` **New interface** ```Python pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) dataset = pipeline.read( "load_data", arguments={...}, ) dataset = dataset.apply( "caption_images", arguments={...}, ) dataset = dataset.apply( "embed_text", arguments={...}, ) dataset.write( "write_to_hf_hub", arguments={...}, )
Closing in favor of #685 |
This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality. The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same. The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset. I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code. **Old interface** ```Python from fondant.pipeline import ComponentOp, Pipeline pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) load_op = ComponentOp( component_dir="load_data", arguments={...}, ) caption_op = ComponentOp.from_registry( name="caption_images", arguments={...}, ) embed_op = ComponentOp( component_dir="embed_text", arguments={...}, ) write_op = ComponentOp.from_registry( name="write_to_hf_hub", arguments={...}, ) pipeline.add_op(load_op) pipeline.add_op(caption_op, dependencies=[load_op]) pipeline.add_op(embed_op, dependencies=[caption_op]) pipeline.add_op(write_op, dependencies=[embed_op]) ``` **New interface** ```Python pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) dataset = pipeline.read( "load_data", arguments={...}, ) dataset = dataset.apply( "caption_images", arguments={...}, ) dataset = dataset.apply( "embed_text", arguments={...}, ) dataset.write( "write_to_hf_hub", arguments={...}, )
This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality. The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same. The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset. I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code. **Old interface** ```Python from fondant.pipeline import ComponentOp, Pipeline pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) load_op = ComponentOp( component_dir="load_data", arguments={...}, ) caption_op = ComponentOp.from_registry( name="caption_images", arguments={...}, ) embed_op = ComponentOp( component_dir="embed_text", arguments={...}, ) write_op = ComponentOp.from_registry( name="write_to_hf_hub", arguments={...}, ) pipeline.add_op(load_op) pipeline.add_op(caption_op, dependencies=[load_op]) pipeline.add_op(embed_op, dependencies=[caption_op]) pipeline.add_op(write_op, dependencies=[embed_op]) ``` **New interface** ```Python pipeline = Pipeline( pipeline_name="my_pipeline", pipeline_description="description of my pipeline", base_path="/foo/bar", ) dataset = pipeline.read( "load_data", arguments={...}, ) dataset = dataset.apply( "caption_images", arguments={...}, ) dataset = dataset.apply( "embed_text", arguments={...}, ) dataset.write( "write_to_hf_hub", arguments={...}, )
Draft implementation of the new pipeline interface:
pipeline.py
(introduce new interface itself)consumes
andproduces
Note: Didn't fixed the test yet. Added a dummy test just to visualise how the pipeline could look like. First wanted to check if this goes into the right direction.