Start from dataset schema for lightweight python component `consumes` #789

RobbeSneyders · 2024-01-17T13:29:35Z

Fixes #785

Opening this as a draft PR since it's not yet clear to me what the desired behavior is.

I'll be using the "inner" / "outer" terminology which we already use in our OperationSpec class to explain. "inner" schema's are the schema's that the Python component consumes / produces. "outer" schema's are the schema's that the DataIO layer consumes / produces.

For docker components, the logic works as follows:

The consumes section in the component spec is the "inner" schema
We leverage the consumes argument of the apply method to calculate the "outer" schema from the "inner" schema.

For lightweight python components, we do not have a component spec to start from. So what I currently implemented is this:

We start from the dataset schema and reverse alter it with the consumes argument to calculate the "inner" schema.
We leverage the consumes argument of the apply method to calculate the "outer" schema from the "inner" schema.

This works, but has one big downside. Since we start from the dataset schema, the calculated "inner" / "outer" consumes contain all the fields in the dataset. In other words, the lack of a component spec removes the ability to select which columns from the dataset to load. Since this is an important part of our optimization, I think we need to find a way around this.

My best idea at this time is to expand the lightweight_component decorator to add support for this. But curious to hear if anyone has other ideas.

RobbeSneyders · 2024-01-17T13:30:07Z

src/fondant/pipeline/pipeline.py

+            consumes_spec = {k: v.type.to_json() for k, v in self.fields.items()}
+            if consumes:
+                for k, v in consumes.items():
+                    consumes_spec[k] = consumes_spec[v]
+


This needs more checks and might have to be moved to a better place. Just wanted to get a PoC working.

PhilippeMoussalli · 2024-01-17T14:41:01Z

My best idea at this time is to expand the lightweight_component decorator to add support for this. But curious to hear if anyone has other ideas.

would you then add a consumes section to the decorator? Is there a way to perhaps infer it directly from the component script?
We know the name of the dataframe variable and all consumed fields should be explicitly called

For example

  def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
      dataframe["a"] = dataframe["a"].map(lambda x: x + self.n)
      dataframe["b"] = dataframe["b"].map(lambda x: x 0 self.n)
      return dataframe

We can parse the script text and infer than both fields a and b should be consumed. Might be some edge cases that I didn't consider

RobbeSneyders · 2024-01-17T16:38:41Z

would you then add a consumes section to the decorator?

That was my idea, yes. It should be optional though, keeping the standard case simple.

Is there a way to perhaps infer it directly from the component script? We know the name of the dataframe variable and all consumed fields should be explicitly called

I don't think this is feasible. If you just look at our own components, it's clear that there's a huge number of cases to consider. Fields could be accessed in other functions, within map_partitions, apply, ..., within a loop iterating over the rows, ...

RobbeSneyders · 2024-01-18T09:41:28Z

would you then add a consumes section to the decorator?

That was my idea, yes. It should be optional though, keeping the standard case simple.

If so, it could look like this for the component in the sample pipeline:

@lightweight_component(consumes={"chunk": pa.string()})
class CalculateChunkLength(PandasTransformComponent):
    def __init__(self, feature_name: str, **kwargs):
        self.feature_name = feature_name

    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe[self.feature_name] = dataframe["chunk"].apply(len)
        return dataframe


_ = dataset.apply(
    ref=CalculateChunkLength,
    consumes={"chunk": "text"},
    produces={"chunk_length": pa.int32()},
    arguments={"feature_name": "chunk_length"},
)

It might not have to contain the whole schema, since we can get that from the dataset schema. It could just be a list of the fields to consume:

@lightweight_component(consumes=["chunk"])
class CalculateChunkLength(PandasTransformComponent):
    def __init__(self, feature_name: str, **kwargs):
        self.feature_name = feature_name

    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        dataframe[self.feature_name] = dataframe["chunk"].apply(len)
        return dataframe


_ = dataset.apply(
    ref=CalculateChunkLength,
    consumes={"chunk": "text"},
    produces={"chunk_length": pa.int32()},
    arguments={"feature_name": "chunk_length"},
)

We would just lose the ability to validate the types between these components.

Not sure what would be most straightforward for the user.

mrchtr · 2024-01-18T13:28:18Z

The decorator feels like the right place for this. Thinking about the user journey here, people will probably start to implement PythonComponents and at some point try to share them. I think the decorator could be become the counterpart to the fondant_component.yaml. Maybe this allows to build a reusable component really easy. Basically, use the decorator information to generate a component spec, use the image as base image and the Python code as component main.

This would help us as well to explain the concepts easily. Decorator is the counterpart to the component_spec, source code the counterpart to the component functionality.

But I would keep this indeed optional. When people start implementing things it is hard to believe they will immediately test it with huge datasets. I guess we don't have to aim for an efficient execution in the developing phase when people using small datasets.

Do you see any disadvantages by using the decorator for it?

If so, it could look like this for the component in the sample pipeline:

_ = dataset.apply(
    ref=CalculateChunkLength,
    consumes={"chunk": "text"},
    produces={"chunk_length": pa.int32()},
    arguments={"feature_name": "chunk_length"},
)

Not sure if this is needed. When people use the decorator as a component spec, they might start naming the consume and produce in the decorator itself that it fits to the available column names. I think the consumes and produces isn't needed in the .apply for instance.

RobbeSneyders · 2024-01-18T15:12:32Z

Not sure if this is needed. When people use the decorator as a component spec, they might start naming the consume and produce in the decorator itself that it fits to the available column names. I think the consumes and produces isn't needed in the .apply for instance.

If this were the case, I would rather keep it in the .apply method, since it is supported there for reusable components as well. I think changing this for the lightweight components would be additional complexity.

But I agree your other points:

Let the decorator work as a component spec "light". That way, it's easy to convert later, and usage of reusable and lightweight components is similar.
Make everything in the decorator optional so the default case is simple, but there's flexibility to achieve more advanced cases.

…consumes-mapping

PhilippeMoussalli · 2024-01-23T08:05:21Z

Not sure if this is needed. When people use the decorator as a component spec, they might start naming the consume and produce in the decorator itself that it fits to the available column names. I think the consumes and produces isn't needed in the .apply for instance.

If this were the case, I would rather keep it in the .apply method, since it is supported there for reusable components as well. I think changing this for the lightweight components would be additional complexity.

But I agree your other points:

Let the decorator work as a component spec "light". That way, it's easy to convert later, and usage of reusable and lightweight components is similar.

Make everything in the decorator optional so the default case is simple, but there's flexibility to achieve more advanced cases.

Updated the PR based on the discussions above.

By default, all the fields specified in the dataset schema will be consumed unless otherwise specified
To only consumes specific fields, the user can now pass a list of fields to consume in the consumes field of the decorator. Additionally, they can also just pass generic to specify that a component is generic

mrchtr

Thanks @PhilippeMoussalli. Left a few small comments. Still want to test the changes.

mrchtr · 2024-01-23T08:48:44Z

tests/pipeline/test_python_component.py

    }

-    @lightweight_component()
+    @lightweight_component(consumes="generic")


I would prefer to have a empty consumes instead of passing "generic" here.

For me it would be fine to consume the whole dataset. It would make the usability less complex and reduce the efficiency of the component execution. We should keep the base interface as simple as possible. Pipeline improvements will probably following later during the development cycle.

I think we don't want to use the term "generic component". What is the issue when we pass None instead of the string?

I think the issue is there are still 3 general options that should be supported without the possibility to mix some of them together.

consumes is None -> non-generic component without specified fields -> consume all the fields in the dataset schema. Does not have an equivalence in the component spec definition of things.

consumes is specified as a list -> non-generic component with specified fields -> start from the fields in the dataset schema and filter based on the fields specified in the list. Equivalent to having specific fields in the component spec.

consumes == generic -> generic component that allows to define dynamic fields in the apply produces/consumes -> the fields to consumes are the ones specified in the apply section. Equivalent to setting additionalProperties to true in the component spec.

The only solution would be to somehow mix the 1st and 3rd option but this would require us to change the component spec to support both dynamic and specified fields which is not something we currently support

@mrchtr updated based on the feedback, consumes is None now simply means to consume all fields from previous components. We lose the concept of additionalProperties for the consumes section in python based components. However I don't think it would be required since we're inferring the fields based on the dataset schema

This is more evident here

@lightweight_component( base_image="python:3.8-slim-buster", extra_requires=["pandas", "dask"], ) class CreateData(DaskLoadComponent): def load(self) -> dd.DataFrame: .... dataset = pipeline.read( ref=CreateData, produces={"x": pa.int32(), "y": pa.int32(), "z": pa.int32()}, ) # dataset schema has x,y,z @lightweight_component class AddN(PandasTransformComponent): def __init__(self, n: int, **kwargs): self.n = n def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame: dataframe["x"] = dataframe["x"].map(lambda x: x + self.n) return dataframe _ = dataset.apply( ref=AddN, produces={"x": pa.int32(), "y": pa.int32(), "z": pa.int32()}, consumes=None, # This now has to be defined as None since we can't define dynamic fields but we can already infer the schema based on the dataset arguments={"n": 1}, )

I think both options are valuable with small tradeoffs.

The previous one would not require any changes between the consumes for the apply function for consumes but slightly more complex interface for the lightweight component

This approach offers more flexibility in the lightweight component with more flexibility, but slightly different way of defining consumes compared to resuable components

Happy to hear other takes on this @RobbeSneyders @GeorgesLorre

Agree with not supporting consumes=generic. The additionalProperties: true is only needed because the user cannot easily change the schema of a reusable component. But they can easily change the schema of a lightweight one. So there's no need for it here.

Lightweight Python components can still be implemented in a generic way without it. It just means that the implementation of the component depends on the consumes argument it receives.

src/fondant/pipeline/lightweight_component.py

RobbeSneyders

Thanks @PhilippeMoussalli!

Left some comments on the structure of the code.

examples/sample_pipeline/pipeline.py

RobbeSneyders · 2024-01-25T09:43:53Z

src/fondant/pipeline/lightweight_component.py

@@ -110,6 +126,41 @@ def validate_abstract_methods_are_implemented(cls):
                    msg,
                )

+        def modify_consumes_spec(apply_consumes, consumes_spec):


Isn't this functionality already available in the OperationSpec class? I would try to keep the implementation in line with the custom components:

Build the consumes specification based on the decorator consumes

Leverage both the consumes specification and consumes apply argument in the OperationSpec

so the flow currently goes like this
-> get consumes_spec (remap names if necessary and filter) -> pass it to the consumes section of the component spec (actual consumed fields) -> build the OperationOp

The reason why we're modifying things in the decorator is that the component spec should already have the mapped name, in resuable components this is usually the case but here it's a bit different.

Let's take an example where we have a dataset of field x and the next component accepts a field a

The component spec for the reusable component would already have the field a in it's schema. For lightweight components, we start with the dataset schema (x) that needs to be remapped to a before being passed to the component spec and then the componentOp.

To replicate the issue you can comment out the function modify_consumes_spec and run this test. By commenting out the mapping and filter function

@classmethod def get_consumes_spec( cls, dataset_fields: t.Mapping[str, Field], apply_consumes: t.Optional[t.Dict[str, t.Union[str, pa.DataType]]], ): python_component_consumes = cls.consumes() # Get consumes spec from the dataset consumes_spec = {k: v.type.to_dict() for k, v in dataset_fields.items()} # Modify naming based on the consumes argument in the 'apply' method #\consumes_spec = cls.modify_consumes_spec(apply_consumes, consumes_spec) # Filter for values that are not in the user defined consumes list # consumes_spec = cls.filter_consumes_spec( # python_component_consumes, # consumes_spec, # ) return consumes_spec

E fondant.core.exceptions.InvalidPipelineDefinition: Received a string value for key `a` in the `consumes` argument passed to the operation, but `a` is not defined in the `consumes` section of the component spec.

The flow I suggest is the following:

# First build the equivalent of the component spec consumes_spec = create_consumes_spec(dataset_fields, python_component_consumes) component_spec = ComponentSpec(consumes_spec, ...) # Then apply the same logic as for containerized components operation_spec = OperationSpec(component_spec, apply_consumes)

I think the issue will still persist:

The component spec should always contain the names of the columns that will be consumed by the component

This was not an issue before since we have it explicitly written in a yaml

Now we are starting from the dataset schema/fields which may have different names than the ones expected by the component, thus we need to dynamically infer the component spec.

The dataset_fields and python_component_consumes which just contains a list of columns to consume alone are not enough to get the ground truth component spec, for that we need to remap it using the apply consumes before passing it to the OperationSpec in order to get the to the actual component spec expected by the component.

For example, starting from a dataset_fields of x,y and assuming python_component_consumes is [x], we cannot infer the component spec for the next component which may expect z -> leads to errors.

I think that's the main reason why we need to include some custom logic beforehand

src/fondant/pipeline/lightweight_component.py

RobbeSneyders · 2024-01-25T09:51:45Z

src/fondant/pipeline/lightweight_component.py

@@ -121,6 +172,32 @@ class PythonComponentOp(cls, PythonComponent):
            def image(cls) -> Image:
                return image

+            @classmethod


You can make this a class property by combining the classmethod and property decorators.

hmm it doesn't seem correct, should I apply getters and setters

It should be reversed, but seems like it only works for Python 3.9 and 3.10 (docs).

Just making it a class attribute could work as well:

class BaseClass: consumes: ConsumesType class Class(BaseClass): consumes=consumes_ # cannot be the same name

hmm I still don't quite follow what should be done here, aren't the class methods needed for the decorators? what's the need for attributes in this example?

RobbeSneyders · 2024-01-25T09:53:46Z

src/fondant/pipeline/pipeline.py

+    def from_ref(
+        cls,
+        ref: t.Any,
+        fields: t.Optional[t.Mapping[str, Field]] = None,


I would prefer to keep this fields argument out of here since this is specific to the lightweight Python components. Can we move this to the PythonComponent class instead?

I don't see a straightforwards way of doing this, unless we somehow pass the fields to the BaseComponent class since this is what the PythonComponent starts from but also not sure how feasible that is. Any other suggestions?

My original implementation did this in the Dataset class. So I would assume we can just call a method on the PythonComponent class at that point?

Ok, this might make less sense after the refactoring on main since my first commits. If we can address my comment above, it's fine for me to keep it like this for now. Would be good to add the argument to the docstring though.

src/fondant/pipeline/lightweight_component.py

Co-authored-by: Robbe Sneyders <[email protected]>

PhilippeMoussalli · 2024-01-30T11:30:53Z

tests/pipeline/test_python_component.py

    }
    pipeline._validate_pipeline_definition(run_id="dummy-run-id")

    DockerCompiler().compile(pipeline)


+def test_consumes_mapping_all_fields(tmp_path_factory, load_pipeline):


@RobbeSneyders this test is still failing. We are facing the same issue as discussed before when loading all the fields (consumed spec needs to be remapped based on the apply consumes before passing to the componentOp). Any other way of tackling this?

The rest of the cases (generic and specific consumes below) seem to work well

The more I think about it, the more I feel like it makes less sense to just pass in the default dataset schema. It doesn't seem to fit in nicely with the current paradigm we have

I see two options:

We accept the current behavior: you can't remap field names in the apply consumes if the component doesn't explicitly define consumes in the decorator.

As you mention, include the remapping when calculating the default consumes. I think this is less of an issue than before, since this is no longer happening if the consumes is specified explicitly in the decorator.

Any other ideas?

If not, I would probably vote for the second one mentioned above.

Indeed, the second option seems to be the most logical for now, will update it

RobbeSneyders

Thanks @PhilippeMoussalli.

Looks good to me, but I can't approve since this was originally my PR 😅

mrchtr

Thanks @PhilippeMoussalli. Found some minor things.

mrchtr · 2024-01-30T14:31:16Z

src/fondant/pipeline/lightweight_component.py

+        pass
+
+    @classmethod
+    def modify_consumes_spec(cls, apply_consumes, consumes_spec):


nitpick: Can we add typing here too.

mrchtr · 2024-01-30T14:33:48Z

src/fondant/pipeline/lightweight_component.py

+        return consumes_spec
+
+    @classmethod
+    def get_consumes_spec(


I think a docstring would be good.

mrchtr · 2024-01-30T14:36:36Z

src/fondant/pipeline/lightweight_component.py

+            # Get consumes spec from the dataset
+            spec_consumes = {k: v.type.to_dict() for k, v in dataset_fields.items()}
+
+            spec_consumes = cls.modify_consumes_spec(apply_consumes, spec_consumes)


Nitpick: I would use consistent naming, so either consumes_spec or spec_consumes.
Maybe we can use inside the modify_consumes_spec something like dataset_consumes to make even more clear.

mrchtr · 2024-01-30T14:43:43Z

src/fondant/pipeline/lightweight_component.py

+                else:
+                    msg = (
+                        f"Invalid data type for field `{k}` in the `apply_consumes` "
+                        f"argument. Only string and pa.DataType are allowed."


If I see it correctly, we are not checking if the type is pa.DataType?

good catch only strings are allowed in case no consumes is passed on the decorator level

Update lightweight docs based on #789

RobbeSneyders added 4 commits January 17, 2024 12:01

Integrate argument inference

7a6b78a

Add compilation to python component test

c007244

Add argument inference to integration test

e52a5c4

Start from dataset schema for python component consumes

161f214

RobbeSneyders requested review from mrchtr, GeorgesLorre and PhilippeMoussalli January 17, 2024 13:29

RobbeSneyders commented Jan 17, 2024

View reviewed changes

Base automatically changed from feature/integrate-arg-inference to main January 18, 2024 16:37

PhilippeMoussalli added 4 commits January 22, 2024 18:14

add option to define consumes in mapping

66a9103

add option to define consumes and generic in mapping

2e10af1

Merge branch 'feature/python-consumes-mapping-3' into feature/python-…

e8d763f

…consumes-mapping

small fixes

6619b3a

PhilippeMoussalli force-pushed the feature/python-consumes-mapping branch 2 times, most recently from a2be107 to 0d2ba12 Compare January 23, 2024 07:25

make lightweight consumes generic by default

d898e4a

PhilippeMoussalli force-pushed the feature/python-consumes-mapping branch from 0d2ba12 to d898e4a Compare January 23, 2024 07:26

PhilippeMoussalli marked this pull request as ready for review January 23, 2024 07:30

PhilippeMoussalli added 2 commits January 23, 2024 08:36

Merge branch 'main' into feature/python-consumes-mapping

2d80a77

revert to desired behaviour

cef482a

mrchtr reviewed Jan 23, 2024

View reviewed changes

update sample pipeline

8c9d154

PhilippeMoussalli force-pushed the feature/python-consumes-mapping branch from 95a5a72 to 0619c41 Compare January 23, 2024 12:43

update based on feedback

4c97282

PhilippeMoussalli force-pushed the feature/python-consumes-mapping branch from 0619c41 to 4c97282 Compare January 23, 2024 12:58

RobbeSneyders commented Jan 25, 2024

View reviewed changes

PhilippeMoussalli added 3 commits January 25, 2024 13:54

implement PR feedback

3ab1bae

add docstrings

b59fb8c

update consumes based on new proposal

de5a3c1

RobbeSneyders commented Jan 30, 2024

View reviewed changes

src/fondant/pipeline/lightweight_component.py Outdated Show resolved Hide resolved

Merge branch 'main' into feature/python-consumes-mapping

3943c4b

PhilippeMoussalli force-pushed the feature/python-consumes-mapping branch from f16dc2a to 3943c4b Compare January 30, 2024 11:23

Update src/fondant/pipeline/lightweight_component.py

d8e5563

Co-authored-by: Robbe Sneyders <[email protected]>

PhilippeMoussalli force-pushed the feature/python-consumes-mapping branch from 1def2c3 to d8e5563 Compare January 30, 2024 11:26

PhilippeMoussalli reviewed Jan 30, 2024

View reviewed changes

enable default behavior of passing all dataset fields

85f0994

PhilippeMoussalli mentioned this pull request Jan 30, 2024

Update lightweight docs #827

Merged

RobbeSneyders commented Jan 30, 2024

View reviewed changes

mrchtr reviewed Jan 30, 2024

View reviewed changes

implement PR feedback

5b69298

PhilippeMoussalli force-pushed the feature/python-consumes-mapping branch from a904fe5 to 5b69298 Compare January 30, 2024 15:04

mrchtr approved these changes Jan 30, 2024

View reviewed changes

GeorgesLorre and others added 2 commits January 30, 2024 16:29

Merge branch 'main' into feature/python-consumes-mapping

12c6f37

Merge branch 'main' into feature/python-consumes-mapping

60dc6f6

PhilippeMoussalli merged commit 9070b82 into main Jan 30, 2024
2 of 3 checks passed

PhilippeMoussalli deleted the feature/python-consumes-mapping branch January 30, 2024 16:17

RobbeSneyders pushed a commit that referenced this pull request Jan 30, 2024

Update lightweight docs (#827)

8e0ec82

Update lightweight docs based on #789

Start from dataset schema for lightweight python component consumes #789

Start from dataset schema for lightweight python component consumes #789

Conversation

RobbeSneyders commented Jan 17, 2024

Choose a reason for hiding this comment

PhilippeMoussalli commented Jan 17, 2024 • edited Loading

RobbeSneyders commented Jan 17, 2024

RobbeSneyders commented Jan 18, 2024

mrchtr commented Jan 18, 2024 • edited Loading

RobbeSneyders commented Jan 18, 2024

PhilippeMoussalli commented Jan 23, 2024

mrchtr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilippeMoussalli Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders Jan 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

mrchtr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Start from dataset schema for lightweight python component `consumes` #789

Start from dataset schema for lightweight python component `consumes` #789

PhilippeMoussalli commented Jan 17, 2024 •

edited

Loading

mrchtr commented Jan 18, 2024 •

edited

Loading

PhilippeMoussalli Jan 23, 2024 •

edited

Loading

RobbeSneyders Jan 25, 2024 •

edited

Loading