Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support applying Lightweight Python components in Pipeline SDK #770

Merged
merged 5 commits into from
Jan 16, 2024

Conversation

GeorgesLorre
Copy link
Collaborator

@GeorgesLorre GeorgesLorre commented Jan 10, 2024

No description provided.

tests/component/test_python_component.py Outdated Show resolved Hide resolved
src/fondant/pipeline/pipeline.py Outdated Show resolved Hide resolved
src/fondant/pipeline/pipeline.py Outdated Show resolved Hide resolved
"additionalProperties": True,
},
}
component_spec = ComponentSpec(spec_dict)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could possibly be refactored by creating and operation_spec directly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently one of the main jobs of the ComponentOp. If we would do this here, we would need to do it in different places for the docker components as well. I think it makes more sense to do this in the ComponentOp than in the Dataset class.

@@ -181,6 +179,22 @@ def __init__(

self.resources = resources or Resources()

@classmethod
def from_component_yaml(cls, path, **kwargs):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be the entrypoint for reusable and custom components (the ones that the python_component does not cover)



@dataclass
class Image:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs way more functionality to actually make the images with extra dependencies and the script available

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the job of the runners?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I guess we'll have shared functionality between runners that could be abstracted here


return PythonComponent

return wrapper
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works but I'm not completely convinced this is the way forward. Essentially we subclass the Component and add an image property with the details provided in the decorator arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other possibility would be to make the original component an attribute of the PythonComponent. It seems to be a trade-off between deep inheritance and having the PythonComponent act like a regular component.

Since I don't see any specific downsides to the deep inheritance in this case, I would be inclined to follow your subclass proposal here.

@GeorgesLorre
Copy link
Collaborator Author

I can recommend this talk on python decorators: https://www.youtube.com/watch?v=MjHpMCIvwsY


self.name = name
self.image = image
self.component_spec = component_spec
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check further but maybe we can do without the component_spec here and use the operation_spec along with the image from this point on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to unpack the component_spec already, and accept the following arguments here:

  • name
  • image
  • consumes
  • produces
  • arguments

That would also make it easier to test in Python code I believe.

Problem then is that we have consumes, produces, and arguments twice, with different meaning. Not sure yet how to solve this in the cleanest way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below on a possible solution where we still accept a component_spec here, but create the ComponentSpec based on arguments instead of a dict.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeorgesLorre, starting to look good!

tests/component/test_python_component.py Outdated Show resolved Hide resolved
pipeline = Pipeline(name="dummy-pipeline", base_path="./data")

@python_component(
base_image="python:3.8-slim-buster",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we'll need to provide a base docker image with Fondant installed, right?

Copy link
Collaborator Author

@GeorgesLorre GeorgesLorre Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we need a sane default which we ideally configure in 1 place



@dataclass
class Image:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the job of the runners?

src/fondant/component/component.py Outdated Show resolved Hide resolved

return PythonComponent

return wrapper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other possibility would be to make the original component an attribute of the PythonComponent. It seems to be a trade-off between deep inheritance and having the PythonComponent act like a regular component.

Since I don't see any specific downsides to the deep inheritance in this case, I would be inclined to follow your subclass proposal here.

src/fondant/pipeline/pipeline.py Outdated Show resolved Hide resolved

self.name = name
self.image = image
self.component_spec = component_spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to unpack the component_spec already, and accept the following arguments here:

  • name
  • image
  • consumes
  • produces
  • arguments

That would also make it easier to test in Python code I believe.

Problem then is that we have consumes, produces, and arguments twice, with different meaning. Not sure yet how to solve this in the cleanest way.

"additionalProperties": True,
},
}
component_spec = ComponentSpec(spec_dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently one of the main jobs of the ComponentOp. If we would do this here, we would need to do it in different places for the docker components as well. I think it makes more sense to do this in the ComponentOp than in the Dataset class.

Comment on lines 389 to 399
spec_dict = {
"name": name,
"description": "This is an example component",
"image": "example_component:latest",
"produces": {
"additionalProperties": True,
},
"consumes": {
"additionalProperties": True,
},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of creating a dict here.

Possible solution, which would also solve my comment on the ComponentOp init arguments above, is that we refactor the ComponentSpec class to take separate init arguments instead:

class ComponentSpec:

def __init__(self, name, image, consumes, produces, arguments):
    ...

@classmethod
def from_dict(cls, spec):
    name, image, consumes, produces, arguments = unpack(spec)
    return cls(name, image, consumes, produces, arguments)


self.name = name
self.image = image
self.component_spec = component_spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below on a possible solution where we still accept a component_spec here, but create the ComponentSpec based on arguments instead of a dict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing the pipeline SDK, so should probably move to tests/pipeline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is the case for the python_component and Image as well. I think these should be part of the pipeline SDK.

@GeorgesLorre GeorgesLorre force-pushed the feature/python-component branch from 6c23966 to 308e59d Compare January 11, 2024 22:16
Copy link
Contributor

@mrchtr mrchtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Looks really promising and I think adds it is great for the usability and quick experiments.

The decorator give use some additional power that we should use imo. I would try to add useful validation and hints to guarantee a low entry barrier for users.

We can probably tackle this in follow PRs :)

def __init__(self, **kwargs):
pass

def load(self) -> dd.DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably to detailed for the first draft but I think we should already consider thinking about such edge cases. I guess it would make sense to implement a validation of the class. Looking at the interface it is possible to use any class that inherits from the BaseComponent as ref. Even if we restrict the interface to Load, Transform, Write (see my comment above) a user might run into issue if he doesn't implement the correct function.

Practical speaking, if I create for instance following class:

class CreateData(...):
    def some_function():
           ...

without the implementation of a load method it would be nice get a validation/exception message when I try to execute the pipeline, that the required load function wasn't implemented.
I think we could add this validation to the decorator implementation.

class Image:
base_image: t.Optional[str] = "fondant:latest"
extra_requires: t.Optional[t.List[str]] = None
script: t.Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the script is needed for the sagemaker runner?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every runner, since the script is not baked into the image. In the docker & KfP runner, we'll probably include the script in the entrypoint instead of uploading it separately though.

def wrapper(cls):
image = Image(
base_image=base_image,
extra_requires=extra_requires,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script parameter is not used here. Is this correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -325,7 +339,7 @@ def register_operation(

def read(
self,
name_or_path: t.Union[str, Path],
ref: t.Union[str, Path, t.Type["BaseComponent"]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can restrict the types here. Read should only allow LoadComponent, apply a XTransformComponent and write a DaskWriteComponent.

},
index=pd.Index(["a", "b", "c"], name="id"),
)
return dd.from_pandas(df, npartitions=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to add a validation for the return type as well. If a user don't returns a dask or pandas dataframe in the custom components, it will not work. It's not related to your code, we have the same requirement already for the custom components. But a here we are lowering the entry barrier and I can imagine that people will not read the documentation in detail anymore.

So I think it might be good if we could catch somehow if it is a wrong return type. I guess it should be possible within the decorator implementation as well.

@mrchtr
Copy link
Contributor

mrchtr commented Jan 12, 2024

The decorator give use some additional power that we should use imo. I would try to add useful validation and hints to guarantee a low entry barrier for users.

To follow up on this - I've played a bit around with the decorator and validations. Maybe it helps.

from typing import TypeVar, Callable

import pandas as pd

T = TypeVar('T')
REQUIRED_METHODS = {"load": [pd.DataFrame]}

def LightWeightComponent(cls):
    class Wrapped(cls):
        def __init__(self, *args, **kwargs):
            for method, types in REQUIRED_METHODS.items():
                attr = self._valid_method_exists(method)
                for _type in types:
                    self._validate_return_type(_type, attr)

        def _valid_method_exists(self, name):
            if not hasattr(self, name):
                raise AttributeError(
                    f"{self.__class__.__name__} Function is missing: {name}")
            else:
                return self.__getattribute__(name)

        def _validate_return_type(self, expected_type: type, func: Callable[..., T]):
            return_type = func.__annotations__["return"]
            if not return_type == expected_type:
                raise AttributeError(
                    f"{return_type} is wrong return type in: {func}. Expected {expected_type}")


    return Wrapped

@LightWeightComponent
class BaseComponent:
    def load(self) -> str:
        pass

@LightWeightComponent
class SubClass(BaseComponent):
    def load(self) -> pd.DataFrame:
        pass

@LightWeightComponent
class SecondSubClass(BaseComponent):
    def load(self) -> str: # Through error during initialisation
        return "0"

if __name__ == '__main__':
    SubClass()
    SecondSubClass()

@RobbeSneyders
Copy link
Member

+1 on the validation, would really help to give immediate feedback instead of only at runtime. Would keep it out of this PR though, so we can merge it and work in parallel on the remaining work.

return cls(component_spec_dict)

@classmethod
def from_args(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in favor of making this the __init__ method, and storing the different parts of the spec as different attributes instead of converting to a spec.

@mrchtr
Copy link
Contributor

mrchtr commented Jan 15, 2024

Would keep it out of this PR though, so we can merge it and work in parallel on the remaining work.

Yes indeed. I guess we could even collect some more validations. This were just the first edge cases which came into my mind.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeorgesLorre.

Only some small comments remaining. Looks good overall, looking forward to merge this once the tests are fixed 🚀

src/fondant/core/component_spec.py Outdated Show resolved Hide resolved
src/fondant/pipeline/pipeline.py Outdated Show resolved Hide resolved
@GeorgesLorre GeorgesLorre force-pushed the feature/python-component branch from e16e6b1 to e9c53c6 Compare January 16, 2024 16:02
@GeorgesLorre GeorgesLorre marked this pull request as ready for review January 16, 2024 16:03
@GeorgesLorre
Copy link
Collaborator Author

Would keep it out of this PR though, so we can merge it and work in parallel on the remaining work.

Yes indeed. I guess we could even collect some more validations. This were just the first edge cases which came into my mind.

Thx for the great ideas @mrchtr Will take this up in subsequent PR's!

@GeorgesLorre GeorgesLorre force-pushed the feature/python-component branch from 118e658 to e137632 Compare January 16, 2024 16:16
@GeorgesLorre GeorgesLorre changed the title Poc for python components support Python components support Jan 16, 2024
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeorgesLorre!

Some minor comments, but let's tackle those in a follow-up PR.


class PythonComponent:
@classmethod
def image(cls) -> Image:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this not to be a (class)property?


component_spec = ComponentSpec(
name,
image.base_image, # TODO: revisit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What needs to be revisited about this?

Comment on lines +390 to +426
if inspect.isclass(ref) and issubclass(ref, PythonComponent):
name = ref.__name__
image = ref.image()
description = ref.__doc__ or "python component"

component_spec = ComponentSpec(
name,
image.base_image, # TODO: revisit
description=description,
consumes={"additionalProperties": True},
produces={"additionalProperties": True},
)

operation = ComponentOp(
name,
image,
component_spec,
produces=produces,
arguments=arguments,
input_partition_rows=input_partition_rows,
resources=resources,
cache=cache,
cluster_type=cluster_type,
client_kwargs=client_kwargs,
)

else:
operation = ComponentOp.from_component_yaml(
ref,
produces=produces,
arguments=arguments,
input_partition_rows=input_partition_rows,
resources=resources,
cache=cache,
cluster_type=cluster_type,
client_kwargs=client_kwargs,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is almost identically repeated three times. Would be good if we can abstract this away.


@dataclass
class Image:
base_image: str = "fondant:latest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should link this to the installed version of Fondant once we add the CI/CD for this.

Comment on lines +239 to +242
if self._is_custom_component(path):
component_dir = Path(path)
else:
component_dir = self._get_registry_path(str(path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this change can be reverted? Although I would sill prefer it we can find a way to remove the component_dir argument from the __init__ method.

@RobbeSneyders RobbeSneyders linked an issue Jan 16, 2024 that may be closed by this pull request
@RobbeSneyders RobbeSneyders changed the title Python components support Support applying Lightweight Python components in Pipeline SDK Jan 16, 2024
@RobbeSneyders RobbeSneyders merged commit abcd36f into main Jan 16, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support applying Lightweight Python components in Pipeline SDK
3 participants