-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support applying Lightweight Python components in Pipeline SDK #770
Conversation
src/fondant/pipeline/pipeline.py
Outdated
"additionalProperties": True, | ||
}, | ||
} | ||
component_spec = ComponentSpec(spec_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could possibly be refactored by creating and operation_spec directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently one of the main jobs of the ComponentOp
. If we would do this here, we would need to do it in different places for the docker components as well. I think it makes more sense to do this in the ComponentOp
than in the Dataset
class.
@@ -181,6 +179,22 @@ def __init__( | |||
|
|||
self.resources = resources or Resources() | |||
|
|||
@classmethod | |||
def from_component_yaml(cls, path, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be the entrypoint for reusable and custom components (the ones that the python_component does not cover)
src/fondant/component/image.py
Outdated
|
||
|
||
@dataclass | ||
class Image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs way more functionality to actually make the images with extra dependencies and the script available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the job of the runners?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but I guess we'll have shared functionality between runners that could be abstracted here
src/fondant/component/component.py
Outdated
|
||
return PythonComponent | ||
|
||
return wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works but I'm not completely convinced this is the way forward. Essentially we subclass the Component and add an image property with the details provided in the decorator arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other possibility would be to make the original component an attribute of the PythonComponent
. It seems to be a trade-off between deep inheritance and having the PythonComponent
act like a regular component.
Since I don't see any specific downsides to the deep inheritance in this case, I would be inclined to follow your subclass proposal here.
I can recommend this talk on python decorators: https://www.youtube.com/watch?v=MjHpMCIvwsY |
|
||
self.name = name | ||
self.image = image | ||
self.component_spec = component_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to check further but maybe we can do without the component_spec here and use the operation_spec along with the image from this point on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose to unpack the component_spec
already, and accept the following arguments here:
- name
- image
- consumes
- produces
- arguments
That would also make it easier to test in Python code I believe.
Problem then is that we have consumes
, produces
, and arguments
twice, with different meaning. Not sure yet how to solve this in the cleanest way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below on a possible solution where we still accept a component_spec
here, but create the ComponentSpec
based on arguments instead of a dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeorgesLorre, starting to look good!
pipeline = Pipeline(name="dummy-pipeline", base_path="./data") | ||
|
||
@python_component( | ||
base_image="python:3.8-slim-buster", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we'll need to provide a base docker image with Fondant
installed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we need a sane default which we ideally configure in 1 place
src/fondant/component/image.py
Outdated
|
||
|
||
@dataclass | ||
class Image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the job of the runners?
src/fondant/component/component.py
Outdated
|
||
return PythonComponent | ||
|
||
return wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other possibility would be to make the original component an attribute of the PythonComponent
. It seems to be a trade-off between deep inheritance and having the PythonComponent
act like a regular component.
Since I don't see any specific downsides to the deep inheritance in this case, I would be inclined to follow your subclass proposal here.
|
||
self.name = name | ||
self.image = image | ||
self.component_spec = component_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose to unpack the component_spec
already, and accept the following arguments here:
- name
- image
- consumes
- produces
- arguments
That would also make it easier to test in Python code I believe.
Problem then is that we have consumes
, produces
, and arguments
twice, with different meaning. Not sure yet how to solve this in the cleanest way.
src/fondant/pipeline/pipeline.py
Outdated
"additionalProperties": True, | ||
}, | ||
} | ||
component_spec = ComponentSpec(spec_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently one of the main jobs of the ComponentOp
. If we would do this here, we would need to do it in different places for the docker components as well. I think it makes more sense to do this in the ComponentOp
than in the Dataset
class.
src/fondant/pipeline/pipeline.py
Outdated
spec_dict = { | ||
"name": name, | ||
"description": "This is an example component", | ||
"image": "example_component:latest", | ||
"produces": { | ||
"additionalProperties": True, | ||
}, | ||
"consumes": { | ||
"additionalProperties": True, | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of creating a dict here.
Possible solution, which would also solve my comment on the ComponentOp
init arguments above, is that we refactor the ComponentSpec
class to take separate init arguments instead:
class ComponentSpec:
def __init__(self, name, image, consumes, produces, arguments):
...
@classmethod
def from_dict(cls, spec):
name, image, consumes, produces, arguments = unpack(spec)
return cls(name, image, consumes, produces, arguments)
|
||
self.name = name | ||
self.image = image | ||
self.component_spec = component_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below on a possible solution where we still accept a component_spec
here, but create the ComponentSpec
based on arguments instead of a dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is testing the pipeline SDK, so should probably move to tests/pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is the case for the python_component
and Image
as well. I think these should be part of the pipeline SDK.
6c23966
to
308e59d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! Looks really promising and I think adds it is great for the usability and quick experiments.
The decorator give use some additional power that we should use imo. I would try to add useful validation and hints to guarantee a low entry barrier for users.
We can probably tackle this in follow PRs :)
def __init__(self, **kwargs): | ||
pass | ||
|
||
def load(self) -> dd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably to detailed for the first draft but I think we should already consider thinking about such edge cases. I guess it would make sense to implement a validation of the class. Looking at the interface it is possible to use any class that inherits from the BaseComponent
as ref. Even if we restrict the interface to Load
, Transform
, Write
(see my comment above) a user might run into issue if he doesn't implement the correct function.
Practical speaking, if I create for instance following class:
class CreateData(...):
def some_function():
...
without the implementation of a load
method it would be nice get a validation/exception message when I try to execute the pipeline, that the required load
function wasn't implemented.
I think we could add this validation to the decorator implementation.
class Image: | ||
base_image: t.Optional[str] = "fondant:latest" | ||
extra_requires: t.Optional[t.List[str]] = None | ||
script: t.Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the script is needed for the sagemaker runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For every runner, since the script is not baked into the image. In the docker & KfP runner, we'll probably include the script in the entrypoint instead of uploading it separately though.
def wrapper(cls): | ||
image = Image( | ||
base_image=base_image, | ||
extra_requires=extra_requires, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script parameter is not used here. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it on my branch:
feature/python-component...feature/python-in-docker
src/fondant/pipeline/pipeline.py
Outdated
@@ -325,7 +339,7 @@ def register_operation( | |||
|
|||
def read( | |||
self, | |||
name_or_path: t.Union[str, Path], | |||
ref: t.Union[str, Path, t.Type["BaseComponent"]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can restrict the types here. Read should only allow LoadComponent
, apply a XTransformComponent
and write a DaskWriteComponent
.
}, | ||
index=pd.Index(["a", "b", "c"], name="id"), | ||
) | ||
return dd.from_pandas(df, npartitions=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to add a validation for the return type as well. If a user don't returns a dask or pandas dataframe in the custom components, it will not work. It's not related to your code, we have the same requirement already for the custom components. But a here we are lowering the entry barrier and I can imagine that people will not read the documentation in detail anymore.
So I think it might be good if we could catch somehow if it is a wrong return type. I guess it should be possible within the decorator implementation as well.
To follow up on this - I've played a bit around with the decorator and validations. Maybe it helps. from typing import TypeVar, Callable
import pandas as pd
T = TypeVar('T')
REQUIRED_METHODS = {"load": [pd.DataFrame]}
def LightWeightComponent(cls):
class Wrapped(cls):
def __init__(self, *args, **kwargs):
for method, types in REQUIRED_METHODS.items():
attr = self._valid_method_exists(method)
for _type in types:
self._validate_return_type(_type, attr)
def _valid_method_exists(self, name):
if not hasattr(self, name):
raise AttributeError(
f"{self.__class__.__name__} Function is missing: {name}")
else:
return self.__getattribute__(name)
def _validate_return_type(self, expected_type: type, func: Callable[..., T]):
return_type = func.__annotations__["return"]
if not return_type == expected_type:
raise AttributeError(
f"{return_type} is wrong return type in: {func}. Expected {expected_type}")
return Wrapped
@LightWeightComponent
class BaseComponent:
def load(self) -> str:
pass
@LightWeightComponent
class SubClass(BaseComponent):
def load(self) -> pd.DataFrame:
pass
@LightWeightComponent
class SecondSubClass(BaseComponent):
def load(self) -> str: # Through error during initialisation
return "0"
if __name__ == '__main__':
SubClass()
SecondSubClass() |
+1 on the validation, would really help to give immediate feedback instead of only at runtime. Would keep it out of this PR though, so we can merge it and work in parallel on the remaining work. |
src/fondant/core/component_spec.py
Outdated
return cls(component_spec_dict) | ||
|
||
@classmethod | ||
def from_args( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still in favor of making this the __init__
method, and storing the different parts of the spec as different attributes instead of converting to a spec.
Yes indeed. I guess we could even collect some more validations. This were just the first edge cases which came into my mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeorgesLorre.
Only some small comments remaining. Looks good overall, looking forward to merge this once the tests are fixed 🚀
e16e6b1
to
e9c53c6
Compare
Thx for the great ideas @mrchtr Will take this up in subsequent PR's! |
118e658
to
e137632
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeorgesLorre!
Some minor comments, but let's tackle those in a follow-up PR.
|
||
class PythonComponent: | ||
@classmethod | ||
def image(cls) -> Image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this not to be a (class)property?
|
||
component_spec = ComponentSpec( | ||
name, | ||
image.base_image, # TODO: revisit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What needs to be revisited about this?
if inspect.isclass(ref) and issubclass(ref, PythonComponent): | ||
name = ref.__name__ | ||
image = ref.image() | ||
description = ref.__doc__ or "python component" | ||
|
||
component_spec = ComponentSpec( | ||
name, | ||
image.base_image, # TODO: revisit | ||
description=description, | ||
consumes={"additionalProperties": True}, | ||
produces={"additionalProperties": True}, | ||
) | ||
|
||
operation = ComponentOp( | ||
name, | ||
image, | ||
component_spec, | ||
produces=produces, | ||
arguments=arguments, | ||
input_partition_rows=input_partition_rows, | ||
resources=resources, | ||
cache=cache, | ||
cluster_type=cluster_type, | ||
client_kwargs=client_kwargs, | ||
) | ||
|
||
else: | ||
operation = ComponentOp.from_component_yaml( | ||
ref, | ||
produces=produces, | ||
arguments=arguments, | ||
input_partition_rows=input_partition_rows, | ||
resources=resources, | ||
cache=cache, | ||
cluster_type=cluster_type, | ||
client_kwargs=client_kwargs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is almost identically repeated three times. Would be good if we can abstract this away.
|
||
@dataclass | ||
class Image: | ||
base_image: str = "fondant:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should link this to the installed version of Fondant once we add the CI/CD for this.
if self._is_custom_component(path): | ||
component_dir = Path(path) | ||
else: | ||
component_dir = self._get_registry_path(str(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this change can be reverted? Although I would sill prefer it we can find a way to remove the component_dir
argument from the __init__
method.
No description provided.