Request: Ability to pass large pandas dataframes between pipeline components (without creating artifacts) #725

joeswashington · 2021-09-02T19:27:51Z

We would like to have the ability to pass the results of a pandas dataframe operation from one pipeline component to another without having to create an input / output component.

In this case, we would need to make a CSV file in one component and share in the other component which is slow.

pugangxa · 2021-09-09T14:22:08Z

What do you. mean of passing the results of a pandas dataframe? So if it's just internal in python I think you should include them in the same component.
Tekton support passing data with results or workspace, and kfp support using artifacts, this is the standard way for sharing data between components, so maybe can consider how to split your logic.

Tomcli · 2021-09-09T16:31:56Z

For @joeswashington 's use case, we probably need to invent a new custom task controller that trying to do similar things in Spark where the output of a pipeline task can be stored in the Spark driver's memory. These kind of use case usually is addressed in the Spark community instead of Tekton, so I would recommend to run all the data frame processing on a Spark cluster and use KFP-Tekton component as the Spark client.

Ark-kun · 2021-11-07T06:36:48Z

@joeswashington Are you sure your request is feasible?

The producer and consumer tasks probably run on different machines. So the producer need to send out the data using network and the consumer container needs to receive the data from network. Also, the producer and consumer run at different time (the consumer task is only started after the producer task finishes). So the data needs to be stored somewhere. The intermediate data storage is also important for cache reuse. You don't want to do the same data processing or training multiple times.

So, it looks like it's inevitable that the produced data is uploaded somewhere and downloaded when it needs to be consumed. You cannot really have a distributed system without passing data over network.

P.S. KFP has a way to seamlessly switch all data-passing to a Kubernetes volume, but we do not really see people using that feature. Kubernetes volumes are also accessed over the network...

stale · 2022-03-02T08:45:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the lifecycle/stale label Mar 2, 2022

VaniHaripriya mentioned this issue Oct 20, 2024

[Snyk] Fix for 2 vulnerabilities VaniHaripriya/data-science-pipelines-tekton#96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Ability to pass large pandas dataframes between pipeline components (without creating artifacts) #725

Request: Ability to pass large pandas dataframes between pipeline components (without creating artifacts) #725

joeswashington commented Sep 2, 2021

pugangxa commented Sep 9, 2021

Tomcli commented Sep 9, 2021

Ark-kun commented Nov 7, 2021

stale bot commented Mar 2, 2022

Request: Ability to pass large pandas dataframes between pipeline components (without creating artifacts) #725

Request: Ability to pass large pandas dataframes between pipeline components (without creating artifacts) #725

Comments

joeswashington commented Sep 2, 2021

pugangxa commented Sep 9, 2021

Tomcli commented Sep 9, 2021

Ark-kun commented Nov 7, 2021

stale bot commented Mar 2, 2022