Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing pipeline workflows with persistent data storage methods #337

Open
hombit opened this issue Dec 22, 2023 · 4 comments
Open

Optimizing pipeline workflows with persistent data storage methods #337

hombit opened this issue Dec 22, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@hombit
Copy link
Contributor

hombit commented Dec 22, 2023

In many practical cases, the pipeline will have a (multi-)funnel structure: initially, a large catalog is pre-filtered, and in subsequent stages, the received dataset is analyzed further. The first stage can be time-consuming (consider dozens of hours on a distributed cluster for ZTF DR), while the second stage may contribute less to the overall processing time. However, researchers may want to run the second stage multiple times, with different parameters or steps, and they would not want to wait for the entire pipeline to run again. At this point, the researcher would like to store the results of the first pre-processing stage, sometimes in memory (e.g., to reuse in subsequent Jupyter cells), sometimes on disk (e.g., to run a separate script).

Currently, we lack tools for such use cases, both for memory and disk scenarios.

  1. Ensemble.persist() would be useful to run the pipeline up to a given point, so that users could run the rest of the pipeline faster.
  2. Ensemble.to_parquet() would be useful to save all ensemble frames to disk.
  3. The previous method would require a counterpart that loads the serialized ensemble from the disk, including ColumnMapper and all associated result frames.

I'd like to have this issue for this problem discussion, let us open other, more technical issues, for implementations, if needed.

@nevencaplar
Copy link
Member

Ensemble.to_parquet() is captured in 151. We will prioritize that first; let us keep this ticket to capture proposed Ensemble.persist() work.

@hombit
Copy link
Contributor Author

hombit commented Dec 26, 2023

Thank you, @nevencaplar, I commented there about result frames. What do you think about result frames loading and about metadata (de)serialization?

@nevencaplar nevencaplar added the enhancement New feature or request label Dec 26, 2023
@hombit
Copy link
Contributor Author

hombit commented Dec 28, 2023

Related to #312

@nevencaplar
Copy link
Member

Thank you, @nevencaplar, I commented there about result frames. What do you think about result frames loading and about metadata (de)serialization?

I am putting it on the agenda for the next meeting. I know you indicated you may not be there; but I believe @dougbrn and @wilsonbb will have ideas about complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants