Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support applying Lightweight Python components in Pipeline SDK #750

Closed
Tracked by #558
RobbeSneyders opened this issue Jan 2, 2024 · 5 comments · Fixed by #770
Closed
Tracked by #558

Support applying Lightweight Python components in Pipeline SDK #750

RobbeSneyders opened this issue Jan 2, 2024 · 5 comments · Fixed by #770
Assignees
Labels
Core Core framework

Comments

@RobbeSneyders
Copy link
Member

RobbeSneyders commented Jan 2, 2024

We want to be able to apply Lightweight Python components as part of a pipeline just like we do with docker components.

from fondant.component import PandasTransformComponent
from fondant.pipeline import Pipeline

class MyComponent(PandasTransformComponent):

    def __init__(...):
        ...

    def transform(dataframe):
        ...

pipeline = Pipeline(...)

dataset = pipeline.read(...)

dataset.apply(
    MyComponent,
    consumes={},
    produces={},
    arguments={},
)

Lightweight Python components will require some additional arguments compared to docker components. I can think of two already:

  • image: docker image to run the code in.
  • dependencies: additional Python dependencies to install in the container before executing.

I see two options:

  • We extend the apply (and read and write) methods on the dataset / pipeline to support these arguments. Since they are not relevant for docker components, we might then want to split this into two separate apply methods for clarity.
  • We add a decorator on the Lightweight Python component which defines these options:
    from fondant.pipeline import LightWeightComponent
    
    @LightWeightComponent(image=..., dependencies=[...])
    class MyComponent(PandasTransformComponent):
        ...

I think I would prefer the second option so we can keep a single apply interface.

@GeorgesLorre
Copy link
Collaborator

GeorgesLorre commented Jan 3, 2024

I also think a decorator will be the cleanest and clearest way of achieving this. This means that an .apply() can take a string (reusable component), path (local custom component) or a decorated class.

We also need to maybe think about the eager execution interface:

This is what I have now (calling .execute())

pipeline = Pipeline(name="foo", description="bar", base_path="/foobar")

class LoadFromParquett(DaskLoadComponent):
    def __init__(self, *_, **__):
        pass
    def load(self) -> dd.DataFrame:
        dask_df = dd.read_parquet("./foobar/sample.parquet", columns=["x", "y"])
        return dask_df

dataset1 = pipeline.execute(
    component=LoadFromParquett,
    produces={"x": pa.int32(), "y": pa.int32()},
)

but it might be nicer to keep .apply() and calling .execute() on top of it

dataset1 = pipeline.apply(
    LoadFromParquett,
    produces={"x": pa.int32(), "y": pa.int32()},
)

dataset1.execute(override_df=a_df)

That way we keep the pipeline definition as is and can to iterative development by calling certain components while creating.

@RobbeSneyders
Copy link
Member Author

I also think a decorator will be the cleanest and clearest way of achieving this. This means that an .apply() can take a string (reusable component), path (local custom component) or a decorated class.

+1

but it might be nicer to keep .apply() and calling .execute() on top of it

Agree with keeping .apply(). I was thinking more towards an environment variable or argument to enable eager execution, but there's multiple options.

  • Environment variable: I like that the code is unchanged and can easily just be run non-eager as well.
  • Argument: I guess this argument should go on the pipeline. Downside is that the code is changed, but it is clear and explicit.
  • execute(): Similar to Dask, requires code changes. I guess we then also need to figure out which part of the graph needs to be executed (Dask executes from the start). What if a previous component was not executed?

One other thing is that if we want to support Eager execution on different runners, we need to know the runner up front. In Apache Beam for instance, you pass the runner to the pipeline when instantiating.

@RobbeSneyders
Copy link
Member Author

I think this ticket can be limited to loading a Python component into a ComponentOp class.

If we look at the ComponentOp class, it requires two types of arguments to be instantiated:

  • A reference to the component
  • The keyword arguments from the apply function

Since we keep the same apply interface, we can just use the keyword arguments directly as well. What we need to implement is the reference to the component for Python components. For dockerized components, this reference is only used to get the component spec.

So this ticket needs to be able to translate a Python component into a component spec, which consists of the following necessary elements:

  • A name
    we can get the name from the component class.

  • A docker image
    If this is not a pre-built docker image, it actually consists of three parts:

    • The base image
      Provided by the user via the decorator as mentioned above, with a default provided by Fondant.
    • The dependencies
      Provided by the user via the decorator as mentioned above.
    • The script to execute
      The Python implementation of the component converted to a self-contained script (KfP example)

    We might want to introduce an abstraction in our code base which can contain either a pre-built image, or these different parts. The runners then need to support executing the different implementations of this abstraction.

  • Consumes / produces
    On the long term, we can try to infer this as much as possible (Validate consumes and infer produces for Lightweight Python components #752), but I think we can start with a default of additionalProperties: true and expect the user to overwrite this with the consumes and produces keywords on the apply method.

  • Arguments
    Infering the arguments should be doable based on the signature of the component (Infer the arguments based on component __init__ arguments #751).

@GeorgesLorre
Copy link
Collaborator

Yes! this is very similar to what I wrote for the xmas demo:

https://github.com/RobbeSneyders/fondant-xmas/pull/3/files#diff-65a733e81e8bc30179d9f957e51a2c6e9e45bc2a4101d54d6d5072f98433a69aR703

Minus the docker image (my demo has eager execution)

RobbeSneyders added a commit that referenced this issue Jan 9, 2024
Fixes #751 

This PR introduces functionality to infer the arguments from a
`Component` class. The result is a dictionary with the argument names as
keys, and `Argument` instances as values, which is the format of
[`component_spec.args`.](https://github.com/ml6team/fondant/blob/8e828441eec8ff91074e5c8ccf16fe405b719594/src/fondant/core/component_spec.py#L193)

We can leverage this behavior for Lightweight Python components as
described in #750.

Did some TDD here, let me know if I missed any cases.
@GeorgesLorre
Copy link
Collaborator

Image

@github-project-automation github-project-automation bot moved this from Ready for development to Done in Fondant development Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Core framework
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants