You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In many practical cases, the pipeline will have a (multi-)funnel structure: initially, a large catalog is pre-filtered, and in subsequent stages, the received dataset is analyzed further. The first stage can be time-consuming (consider dozens of hours on a distributed cluster for ZTF DR), while the second stage may contribute less to the overall processing time. However, researchers may want to run the second stage multiple times, with different parameters or steps, and they would not want to wait for the entire pipeline to run again. At this point, the researcher would like to store the results of the first pre-processing stage, sometimes in memory (e.g., to reuse in subsequent Jupyter cells), sometimes on disk (e.g., to run a separate script).
Currently, we lack tools for such use cases, both for memory and disk scenarios.
Ensemble.persist() would be useful to run the pipeline up to a given point, so that users could run the rest of the pipeline faster.
Ensemble.to_parquet() would be useful to save all ensemble frames to disk.
The previous method would require a counterpart that loads the serialized ensemble from the disk, including ColumnMapper and all associated result frames.
I'd like to have this issue for this problem discussion, let us open other, more technical issues, for implementations, if needed.
The text was updated successfully, but these errors were encountered:
Thank you, @nevencaplar, I commented there about result frames. What do you think about result frames loading and about metadata (de)serialization?
I am putting it on the agenda for the next meeting. I know you indicated you may not be there; but I believe @dougbrn and @wilsonbb will have ideas about complexity.
In many practical cases, the pipeline will have a (multi-)funnel structure: initially, a large catalog is pre-filtered, and in subsequent stages, the received dataset is analyzed further. The first stage can be time-consuming (consider dozens of hours on a distributed cluster for ZTF DR), while the second stage may contribute less to the overall processing time. However, researchers may want to run the second stage multiple times, with different parameters or steps, and they would not want to wait for the entire pipeline to run again. At this point, the researcher would like to store the results of the first pre-processing stage, sometimes in memory (e.g., to reuse in subsequent Jupyter cells), sometimes on disk (e.g., to run a separate script).
Currently, we lack tools for such use cases, both for memory and disk scenarios.
Ensemble.persist()
would be useful to run the pipeline up to a given point, so that users could run the rest of the pipeline faster.Ensemble.to_parquet()
would be useful to save all ensemble frames to disk.ColumnMapper
and all associated result frames.I'd like to have this issue for this problem discussion, let us open other, more technical issues, for implementations, if needed.
The text was updated successfully, but these errors were encountered: