Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic read write component #214

Merged
merged 48 commits into from
Jun 21, 2023
Merged

Conversation

PhilippeMoussalli
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli commented Jun 15, 2023

Draft PR for implementing generic read/write components.

Note: the changes from feature/local-starcoder were merged to this branch to make testing easier. Merge this PR after the local startcoder PR.

Things to do:

  • Modify from_registry method to enable passing path to custom spec
  • Figure out how to present spec template to user. Should we keep the image/args and just add a todo in the produces and consumes (open for suggestions)
  • Switch over all pipelines to use the generic load/write and remove the custom ones
  • Add components to affected pipelines to add missing metadata (e.g. width/height)
  • Add documentation

Base automatically changed from enable-optional-component-arguments to main June 15, 2023 14:37
@PhilippeMoussalli PhilippeMoussalli changed the title Generic read write component [Draft] Generic read write component Jun 15, 2023
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli. This should make it a lot easier for users to build a pipeline without having to build any custom component!


args:
dataset_name:
description: Name of dataset on the hub
type: str
column_name_mapping:
description: column to map the original column names of the input dataset to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: column to map the original column names of the input dataset to
description: Mapping of the read hub column names to the produced fondant column names.

components/load_from_hf_hub/fondant_component.yaml Outdated Show resolved Hide resolved
Comment on lines -56 to -61
dask_df["images_width"] = dask_df["images_data"].map(
extract_width, meta=("images_width", int)
)
dask_df["images_height"] = dask_df["images_data"].map(
extract_height, meta=("images_height", int)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we could still do for image columns right? Although then it needs to match the provided component spec as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure about that one, it might be that the original dataset has this metadata and we're assuming that the user requires it.

I was thinking about implementing another component that generates image metadata to go around this, it can be based on a conditional arguments (e.g. estimate_width: True/False, ..).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, downside is that it requires loading the image data into the component. We might be able to do this by only loading the first x bytes, but not sure how this works with different image formats and if this can be done performant with parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found this library: https://github.com/shibukawa/imagesize_py
Tested it out and it seems to work pretty well

components/write_to_hf_hub/fondant_component.yaml Outdated Show resolved Hide resolved
components/write_to_hf_hub/fondant_component.yaml Outdated Show resolved Hide resolved
@PhilippeMoussalli PhilippeMoussalli changed the base branch from main to feature/local-starcoder June 19, 2023 15:52
@PhilippeMoussalli PhilippeMoussalli changed the title [Draft] Generic read write component [WIP] Generic read write component Jun 19, 2023
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli.

Let's use the PandasTransformComponent for all new transform components from now on. And some more smaller comments 🙂

components/image_resolution_extraction/src/main.py Outdated Show resolved Hide resolved
components/image_resolution_extraction/src/main.py Outdated Show resolved Hide resolved
components/image_resolution_extraction/src/main.py Outdated Show resolved Hide resolved
components/image_resolution_extraction/src/main.py Outdated Show resolved Hide resolved
components/load_from_hf_hub/fondant_component.yaml Outdated Show resolved Hide resolved
examples/pipelines/finetune_stable_diffusion/pipeline.py Outdated Show resolved Hide resolved
examples/pipelines/starcoder/pipeline.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be in Georges' PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a suggestion

@RobbeSneyders RobbeSneyders dismissed their stale review June 20, 2023 08:46

Re-reviewed new changes

@RobbeSneyders RobbeSneyders linked an issue Jun 20, 2023 that may be closed by this pull request
@PhilippeMoussalli PhilippeMoussalli changed the title [WIP] Generic read write component Generic read write component Jun 20, 2023
@GeorgesLorre GeorgesLorre force-pushed the feature/local-starcoder branch from 6dd333e to 30db0d8 Compare June 20, 2023 12:45
@@ -0,0 +1,52 @@
"""This component filters images of the dataset based on image size (minimum height and width)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring to be updated

Comment on lines +13 to +19
fields:
width:
type: int16
height:
type: int16
data:
type: binary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For components that only add columns, is there a need to specify existing columns in the produces section?

cc @RobbeSneyders

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory not, but I'm not sure if it works in practice already if you leave them out.

Comment on lines 19 to 20
description: A list containing the original hub image column names. Used to format the image
from HF hub format to a byte string
Copy link
Contributor

@NielsRogge NielsRogge Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: A list containing the original hub image column names. Used to format the image
from HF hub format to a byte string
description: Optional argument, a list containing the original image column names in case the dataset on the hub contains them. Used to format the image from HF hub format to a byte string.

from HF hub format to a byte string
type: list
default: None
n_rows_to_load:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_rows_to_load:
num_rows_to_load:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, he just switched this from nb_rows_to_load on my request 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're both clear ;P I'll leave it as is for now

columns={"image": "images_data", "text": "captions_data"}
)
# 2) Make sure images are bytes instead of dicts
if image_column_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if image_column_names:
if image_column_names is not None:

This is usually clearer

)

# 3) Rename columns
dask_df = dask_df.rename(columns=column_name_mapping)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't create hierarchical columns right?

Is this necessary given that we now use them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the columns are still stored as {subset}_{field} in parquet. They are only transformed to hierarchical columns in the Pandas component.

Base automatically changed from feature/local-starcoder to main June 20, 2023 12:56
# Map image column to hf data format
feature_encoder = datasets.Image(decode=True)

if image_column_names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if image_column_names:
if image_column_names is not None:

)

# Map column names to hf data format
if column_name_mapping:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if column_name_mapping:
if column_name_mapping is not None:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mapping is empty, we don't need to do this either.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @PhilippeMoussalli!

@RobbeSneyders RobbeSneyders merged commit 26dfad8 into main Jun 21, 2023
@RobbeSneyders RobbeSneyders deleted the generic_read_write_component branch June 21, 2023 07:16
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
Draft PR for implementing generic read/write components. 

Note: the changes from `feature/local-starcoder` were merged to this
branch to make testing easier. Merge this PR after the local startcoder
PR.

Things to do: 

- [x] Modify `from_registry` method to enable passing path to custom
spec
- [x] Figure out how to present spec template to user. Should we keep
the image/args and just add a todo in the `produces` and `consumes`
(open for suggestions)
- [x] Switch over all pipelines to use the generic load/write and remove
the custom ones
- [x] Add components to affected pipelines to add missing metadata (e.g.
width/height)
- [x] Add documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Modify components that read/write to HF hub to be generic
5 participants