-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: Ability to pass large pandas dataframes between pipeline components (without creating artifacts) #725
Comments
What do you. mean of passing the results of a pandas dataframe? So if it's just internal in python I think you should include them in the same component. |
For @joeswashington 's use case, we probably need to invent a new custom task controller that trying to do similar things in Spark where the output of a pipeline task can be stored in the Spark driver's memory. These kind of use case usually is addressed in the Spark community instead of Tekton, so I would recommend to run all the data frame processing on a Spark cluster and use KFP-Tekton component as the Spark client. |
@joeswashington Are you sure your request is feasible? The producer and consumer tasks probably run on different machines. So the producer need to send out the data using network and the consumer container needs to receive the data from network. Also, the producer and consumer run at different time (the consumer task is only started after the producer task finishes). So the data needs to be stored somewhere. The intermediate data storage is also important for cache reuse. You don't want to do the same data processing or training multiple times. So, it looks like it's inevitable that the produced data is uploaded somewhere and downloaded when it needs to be consumed. You cannot really have a distributed system without passing data over network. P.S. KFP has a way to seamlessly switch all data-passing to a Kubernetes volume, but we do not really see people using that feature. Kubernetes volumes are also accessed over the network... |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We would like to have the ability to pass the results of a pandas dataframe operation from one pipeline component to another without having to create an input / output component.
In this case, we would need to make a CSV file in one component and share in the other component which is slow.
The text was updated successfully, but these errors were encountered: