Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to datasets & apply interface #683

Merged

Conversation

RobbeSneyders
Copy link
Member

@RobbeSneyders RobbeSneyders commented Nov 28, 2023

This PR is the first one of multiple PRs to replace #665. This PR only focuses on implementing the new pipeline interface, without adding any new functionality.

The new interface applies operations to intermediate datasets instead of adding operations to a pipeline, as shown below. It's a superficial change, since only the interface is changed. All underlying behavior is still the same.

The new interface fits nicely with our data format design and we'll be able to leverage it for interactive development in the future. We can calculate the schema for each intermediate dataset so the user can inspect it. Or with eager execution, we could execute a single operation and allow the user to explore the data using the dataset.

I still need to update the README generation, but I'll do that as a separate PR. It becomes a bit more complex since we now need to discriminate between read, transform, and write components to generate the example code.

Old interface

from fondant.pipeline import ComponentOp, Pipeline


pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

load_op = ComponentOp(
    component_dir="load_data",
    arguments={...},
)

caption_op = ComponentOp.from_registry(
    name="caption_images",
    arguments={...},
)

embed_op = ComponentOp(
    component_dir="embed_text",
    arguments={...},
)

write_op = ComponentOp.from_registry(
    name="write_to_hf_hub",
    arguments={...},
)

pipeline.add_op(load_op)
pipeline.add_op(caption_op, dependencies=[load_op])
pipeline.add_op(embed_op, dependencies=[caption_op])
pipeline.add_op(write_op, dependencies=[embed_op])

New interface

pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

dataset = pipeline.read(
    "load_data",
    arguments={...},
)
dataset = dataset.apply(
    "caption_images",
    arguments={...},
)
dataset = dataset.apply(
    "embed_text",
    arguments={...},
)
dataset.write(
    "write_to_hf_hub",
    arguments={...},
)

Comment on lines 324 to 328
# TODO: make dataset names unique so the same operation can be applied multiple times in
# a single pipeline
# input_name = "-".join([dataset.name for dataset in datasets]) noqa: ERA001
# input_hash = abs(hash(input_name)) noqa: ERA001
# output_name = f"{input_hash}_{operation.name}" noqa: ERA001
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want to include this, but it requires some larger changes to the compiler tests since we can no longer use pipeline_configs.component_configs[component_name] to check the configuration for a specific operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this, but I think we need to tackle this more thoroughly. For now I reverted the Fondant graph to not include the datasets, and just track dependencies between operations as we did before.

Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Robbe! Left a few small comments

resources: t.Optional[Resources] = None,
cache: t.Optional[bool] = True,
cluster_type: t.Optional[str] = "default",
client_kwargs: t.Optional[dict] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the docstrings back here and to all other use facing methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already started doing this locally 👍 Will push soon.

}
return Dataset(pipeline=self, operation=operation)

def read(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we missing the schema here? Same for the write function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this out for now, will add this in a separate PR. This just changes the interface, but everything still works like before. At this stage, you still need to overwrite the component spec for generic components.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I can imagine it will require more fundamental changes so might be best indeed to leave it to a separate PR

@RobbeSneyders RobbeSneyders force-pushed the feature/pipeline-interface branch from 5bb1725 to d43a474 Compare November 28, 2023 17:36
@RobbeSneyders RobbeSneyders changed the base branch from main to feature/dataset-apply-interface November 28, 2023 17:47
@RobbeSneyders RobbeSneyders merged commit 1048271 into feature/dataset-apply-interface Nov 28, 2023
4 checks passed
@RobbeSneyders RobbeSneyders deleted the feature/pipeline-interface branch November 28, 2023 17:47
RobbeSneyders added a commit that referenced this pull request Dec 7, 2023
This PR is the first one of multiple PRs to replace #665. This PR only
focuses on implementing the new pipeline interface, without adding any
new functionality.

The new interface applies operations to intermediate datasets instead of
adding operations to a pipeline, as shown below. It's a superficial
change, since only the interface is changed. All underlying behavior is
still the same.

The new interface fits nicely with our data format design and we'll be
able to leverage it for interactive development in the future. We can
calculate the schema for each intermediate dataset so the user can
inspect it. Or with eager execution, we could execute a single operation
and allow the user to explore the data using the dataset.

I still need to update the README generation, but I'll do that as a
separate PR. It becomes a bit more complex since we now need to
discriminate between read, transform, and write components to generate
the example code.

**Old interface**
```Python
from fondant.pipeline import ComponentOp, Pipeline

pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

load_op = ComponentOp(
    component_dir="load_data",
    arguments={...},
)

caption_op = ComponentOp.from_registry(
    name="caption_images",
    arguments={...},
)

embed_op = ComponentOp(
    component_dir="embed_text",
    arguments={...},
)

write_op = ComponentOp.from_registry(
    name="write_to_hf_hub",
    arguments={...},
)

pipeline.add_op(load_op)
pipeline.add_op(caption_op, dependencies=[load_op])
pipeline.add_op(embed_op, dependencies=[caption_op])
pipeline.add_op(write_op, dependencies=[embed_op])
```

**New interface**
```Python
pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

dataset = pipeline.read(
    "load_data",
    arguments={...},
)
dataset = dataset.apply(
    "caption_images",
    arguments={...},
)
dataset = dataset.apply(
    "embed_text",
    arguments={...},
)
dataset.write(
    "write_to_hf_hub",
    arguments={...},
)
RobbeSneyders added a commit that referenced this pull request Dec 7, 2023
This PR is the first one of multiple PRs to replace #665. This PR only
focuses on implementing the new pipeline interface, without adding any
new functionality.

The new interface applies operations to intermediate datasets instead of
adding operations to a pipeline, as shown below. It's a superficial
change, since only the interface is changed. All underlying behavior is
still the same.

The new interface fits nicely with our data format design and we'll be
able to leverage it for interactive development in the future. We can
calculate the schema for each intermediate dataset so the user can
inspect it. Or with eager execution, we could execute a single operation
and allow the user to explore the data using the dataset.

I still need to update the README generation, but I'll do that as a
separate PR. It becomes a bit more complex since we now need to
discriminate between read, transform, and write components to generate
the example code.

**Old interface**
```Python
from fondant.pipeline import ComponentOp, Pipeline

pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

load_op = ComponentOp(
    component_dir="load_data",
    arguments={...},
)

caption_op = ComponentOp.from_registry(
    name="caption_images",
    arguments={...},
)

embed_op = ComponentOp(
    component_dir="embed_text",
    arguments={...},
)

write_op = ComponentOp.from_registry(
    name="write_to_hf_hub",
    arguments={...},
)

pipeline.add_op(load_op)
pipeline.add_op(caption_op, dependencies=[load_op])
pipeline.add_op(embed_op, dependencies=[caption_op])
pipeline.add_op(write_op, dependencies=[embed_op])
```

**New interface**
```Python
pipeline = Pipeline(
    pipeline_name="my_pipeline",
    pipeline_description="description of my pipeline",
    base_path="/foo/bar",
)

dataset = pipeline.read(
    "load_data",
    arguments={...},
)
dataset = dataset.apply(
    "caption_images",
    arguments={...},
)
dataset = dataset.apply(
    "embed_text",
    arguments={...},
)
dataset.write(
    "write_to_hf_hub",
    arguments={...},
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants