Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update components to implement the PandasTransformComponent #203

Closed
RobbeSneyders opened this issue Jun 14, 2023 · 1 comment · Fixed by #302
Closed

Update components to implement the PandasTransformComponent #203

RobbeSneyders opened this issue Jun 14, 2023 · 1 comment · Fixed by #302
Assignees
Labels
Components Implementation of components

Comments

@RobbeSneyders
Copy link
Member

As follow up of #200, we should update our reusable transform components to leverage the PandasTransformComponent

@RobbeSneyders RobbeSneyders converted this from a draft issue Jun 14, 2023
@RobbeSneyders RobbeSneyders added the Components Implementation of components label Jun 14, 2023
@RobbeSneyders RobbeSneyders mentioned this issue Jun 14, 2023
3 tasks
RobbeSneyders added a commit that referenced this issue Jun 14, 2023
Fixes #183 

There's some todo's left, we should
- [ ] Look into the redefinition of the divisions after we clear them.
Now that we take this out of the hands of the user, we should define
which strategy we want to follow here. (#205)
- [ ] Move to `hierarchical columns`. Pandas can work with hierarchical
columns, which would be a lot nicer as a user interface. I want to check
if I can make this work with Dask, and otherwise move the translation
from underscored names to hierarchical columns and back at the level of
the `PandasTransformComponent` (#204)
- [ ] Update the reusable components to leverage the
`PandasTransformComponent` (#203)
@RobbeSneyders RobbeSneyders self-assigned this Jun 16, 2023
@RobbeSneyders RobbeSneyders moved this from Ready for development to In Progress in Fondant development Jun 16, 2023
RobbeSneyders added a commit that referenced this issue Jun 20, 2023
…omponent (#219)

Contributes to #203 

This PR updates the reusable components of the stable diffusion /
controlnet pipeline to use the new Pandas interface. I submit these
separately since I've already tested them.

Some simplifications are clear:
- No more `map_partitions` needed for batching
- No more `meta` needed for `apply` or similar methods

Since we now set the `meta` of the `map_partitions` in the
`PandasTransformComponent` based on the component spec, the returned
dataframes are validated quite strictly. It does not allow any
additional columns.

The Pandas implementation can probably be improved further as I'm not
the biggest Pandas pro.
@RobbeSneyders
Copy link
Member Author

#219 did this for all relevant components in the Controlnet and Stable Diffusion pipelines. We still need to do this for the others.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Fondant development Jul 18, 2023
satishjasthi pushed a commit to satishjasthi/fondant that referenced this issue Jul 21, 2023
This PR follows up on the PoC presented in ml6team#268

---

Fixes ml6team#257 

It splits the implementation and execution of components, this has some
advantages:

- Pandas components can use `__init__` instead of setup, which is
probably more familiar to users
- Other components can use `__init__` as well instead of receiving all
arguments to their transform or equivalent method, aligning
implementation of different component types
- Component implementation and execution should be easier to test
separately

I borrowed the executor terminology from KfP.

---

Fixes ml6team#203 

Since I had to update all the components, I also switched some of them
to subclass `PandasTransformComponent` instead of
`DaskTransformComponent`.

---

These changes open some opportunities for additional improvements, but I
propose to tackle those as separate PRs as this PR is already quite huge
due to all the changes to the components.

- [ ] ml6team#300
- [ ] ml6team#301
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
Fixes #183 

There's some todo's left, we should
- [ ] Look into the redefinition of the divisions after we clear them.
Now that we take this out of the hands of the user, we should define
which strategy we want to follow here. (#205)
- [ ] Move to `hierarchical columns`. Pandas can work with hierarchical
columns, which would be a lot nicer as a user interface. I want to check
if I can make this work with Dask, and otherwise move the translation
from underscored names to hierarchical columns and back at the level of
the `PandasTransformComponent` (#204)
- [ ] Update the reusable components to leverage the
`PandasTransformComponent` (#203)
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
…omponent (#219)

Contributes to #203 

This PR updates the reusable components of the stable diffusion /
controlnet pipeline to use the new Pandas interface. I submit these
separately since I've already tested them.

Some simplifications are clear:
- No more `map_partitions` needed for batching
- No more `meta` needed for `apply` or similar methods

Since we now set the `meta` of the `map_partitions` in the
`PandasTransformComponent` based on the component spec, the returned
dataframes are validated quite strictly. It does not allow any
additional columns.

The Pandas implementation can probably be improved further as I'm not
the biggest Pandas pro.
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
This PR follows up on the PoC presented in #268

---

Fixes #257 

It splits the implementation and execution of components, this has some
advantages:

- Pandas components can use `__init__` instead of setup, which is
probably more familiar to users
- Other components can use `__init__` as well instead of receiving all
arguments to their transform or equivalent method, aligning
implementation of different component types
- Component implementation and execution should be easier to test
separately

I borrowed the executor terminology from KfP.

---

Fixes #203 

Since I had to update all the components, I also switched some of them
to subclass `PandasTransformComponent` instead of
`DaskTransformComponent`.

---

These changes open some opportunities for additional improvements, but I
propose to tackle those as separate PRs as this PR is already quite huge
due to all the changes to the components.

- [ ] #300
- [ ] #301
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Components Implementation of components
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant