You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dataset=dataset.apply(
"chunk_text",
arguments={"chunk_size": 10, "chunk_overlap": 2},
consumes={"text": "text_data"},
)
@lightweight_component(base_image="python:3.8",extra_requires=[# TODO: remove once we have a default base image"fondant[component]@git+https://github.com/ml6team/fondant@3f3d8ad", ],)classCalculateChunkLength(PandasTransformComponent):
deftransform(self, dataframe: pd.DataFrame) ->pd.DataFrame:
dataframe["chunk_length"] =dataframe["chunk"].apply(len)
returndataframe_=dataset.apply(
ref=CalculateChunkLength,
consumes={"chunk": "text"},
produces={"chunk_length": pa.int32()},
)
leads to the following error:
fondant.core.exceptions.InvalidPipelineDefinition: Received a string value for key chunk in the `consumes` argument passed to the operation, but chunk is not defined in the `consumes` section of the component spec.
…#789)
Fixes#785
Opening this as a draft PR since it's not yet clear to me what the
desired behavior is.
I'll be using the "inner" / "outer" terminology which we already use in
our `OperationSpec` class to explain. "inner" schema's are the schema's
that the Python component consumes / produces. "outer" schema's are the
schema's that the `DataIO` layer consumes / produces.
For docker components, the logic works as follows:
1. The `consumes` section in the component spec is the "inner" schema
2. We leverage the `consumes` argument of the `apply` method to
calculate the "outer" schema from the "inner" schema.
For lightweight python components, we do not have a component spec to
start from. So what I currently implemented is this:
1. We start from the dataset schema and reverse alter it with the
`consumes` argument to calculate the "inner" schema.
2. We leverage the `consumes` argument of the `apply` method to
calculate the "outer" schema from the "inner" schema.
This works, but has one big downside. Since we start from the dataset
schema, the calculated "inner" / "outer" consumes contain all the fields
in the dataset. In other words, the lack of a component spec removes the
ability to select which columns from the dataset to load. Since this is
an important part of our optimization, I think we need to find a way
around this.
My best idea at this time is to expand the `lightweight_component`
decorator to add support for this. But curious to hear if anyone has
other ideas.
---------
Co-authored-by: Philippe Moussalli <[email protected]>
Co-authored-by: Georges Lorré <[email protected]>
The following code:
leads to the following error:
Raised at:
fondant/src/fondant/core/component_spec.py
Line 488 in abcd36f
The text was updated successfully, but these errors were encountered: