Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable mapping of field names with consumes argument for lightweight components #785

Closed
Tracked by #558
RobbeSneyders opened this issue Jan 16, 2024 · 0 comments · Fixed by #789
Closed
Tracked by #558
Assignees
Labels
Core Core framework

Comments

@RobbeSneyders
Copy link
Member

RobbeSneyders commented Jan 16, 2024

The following code:

dataset = dataset.apply(
    "chunk_text",
    arguments={"chunk_size": 10, "chunk_overlap": 2},
    consumes={"text": "text_data"},
)


@lightweight_component(
    base_image="python:3.8",
    extra_requires=[
        # TODO: remove once we have a default base image
        "fondant[component]@git+https://github.com/ml6team/fondant@3f3d8ad",
    ],
)
class CalculateChunkLength(PandasTransformComponent):
    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe["chunk_length"] = dataframe["chunk"].apply(len)
        return dataframe


_ = dataset.apply(
    ref=CalculateChunkLength,
    consumes={"chunk": "text"},
    produces={"chunk_length": pa.int32()},
)

leads to the following error:

fondant.core.exceptions.InvalidPipelineDefinition: Received a string value for key chunk in the `consumes` argument passed to the operation, but chunk is not defined in the `consumes` section of the component spec.

Raised at:

raise InvalidPipelineDefinition(msg)

@RobbeSneyders RobbeSneyders added the Core Core framework label Jan 17, 2024
@RobbeSneyders RobbeSneyders self-assigned this Jan 17, 2024
@RobbeSneyders RobbeSneyders moved this from Backlog to Ready for development in Fondant development Jan 17, 2024
@RobbeSneyders RobbeSneyders moved this from Ready for development to In Progress in Fondant development Jan 17, 2024
@RobbeSneyders RobbeSneyders moved this from In Progress to Validation in Fondant development Jan 18, 2024
PhilippeMoussalli added a commit that referenced this issue Jan 30, 2024
…#789)

Fixes #785 

Opening this as a draft PR since it's not yet clear to me what the
desired behavior is.

I'll be using the "inner" / "outer" terminology which we already use in
our `OperationSpec` class to explain. "inner" schema's are the schema's
that the Python component consumes / produces. "outer" schema's are the
schema's that the `DataIO` layer consumes / produces.

For docker components, the logic works as follows:
1. The `consumes` section in the component spec is the "inner" schema
2. We leverage the `consumes` argument of the `apply` method to
calculate the "outer" schema from the "inner" schema.

For lightweight python components, we do not have a component spec to
start from. So what I currently implemented is this:
1. We start from the dataset schema and reverse alter it with the
`consumes` argument to calculate the "inner" schema.
2. We leverage the `consumes` argument of the `apply` method to
calculate the "outer" schema from the "inner" schema.

This works, but has one big downside. Since we start from the dataset
schema, the calculated "inner" / "outer" consumes contain all the fields
in the dataset. In other words, the lack of a component spec removes the
ability to select which columns from the dataset to load. Since this is
an important part of our optimization, I think we need to find a way
around this.

My best idea at this time is to expand the `lightweight_component`
decorator to add support for this. But curious to hear if anyone has
other ideas.

---------

Co-authored-by: Philippe Moussalli <[email protected]>
Co-authored-by: Georges Lorré <[email protected]>
@github-project-automation github-project-automation bot moved this from Validation to Done in Fondant development Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Core framework
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant