Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Ability to pass large pandas dataframes between pipeline components (without creating artifacts) #725

Open
joeswashington opened this issue Sep 2, 2021 · 4 comments

Comments

@joeswashington
Copy link

We would like to have the ability to pass the results of a pandas dataframe operation from one pipeline component to another without having to create an input / output component.

In this case, we would need to make a CSV file in one component and share in the other component which is slow.

@pugangxa
Copy link
Contributor

pugangxa commented Sep 9, 2021

What do you. mean of passing the results of a pandas dataframe? So if it's just internal in python I think you should include them in the same component.
Tekton support passing data with results or workspace, and kfp support using artifacts, this is the standard way for sharing data between components, so maybe can consider how to split your logic.

@Tomcli
Copy link
Member

Tomcli commented Sep 9, 2021

For @joeswashington 's use case, we probably need to invent a new custom task controller that trying to do similar things in Spark where the output of a pipeline task can be stored in the Spark driver's memory. These kind of use case usually is addressed in the Spark community instead of Tekton, so I would recommend to run all the data frame processing on a Spark cluster and use KFP-Tekton component as the Spark client.

@Ark-kun
Copy link

Ark-kun commented Nov 7, 2021

@joeswashington Are you sure your request is feasible?

The producer and consumer tasks probably run on different machines. So the producer need to send out the data using network and the consumer container needs to receive the data from network. Also, the producer and consumer run at different time (the consumer task is only started after the producer task finishes). So the data needs to be stored somewhere. The intermediate data storage is also important for cache reuse. You don't want to do the same data processing or training multiple times.

So, it looks like it's inevitable that the produced data is uploaded somewhere and downloaded when it needs to be consumed. You cannot really have a distributed system without passing data over network.

P.S. KFP has a way to seamlessly switch all data-passing to a Kubernetes volume, but we do not really see people using that feature. Kubernetes volumes are also accessed over the network...

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants