Skip to content
This repository has been archived by the owner on Apr 25, 2023. It is now read-only.

enhanced parallelization #33

Open
bryantChhun opened this issue Jun 30, 2021 · 1 comment
Open

enhanced parallelization #33

bryantChhun opened this issue Jun 30, 2021 · 1 comment
Assignees

Comments

@bryantChhun
Copy link
Contributor

bryantChhun commented Jun 30, 2021

problem

Each module run_patch.py run_segmentation.py etc... launches a process to parallelize across sites, but often we have a large dataset for a given site (high Z, T, C count) and are not able to parallelize each, say, timepoint.

possible solutions

Each module uses its own worker class and the python multiprocessing library to spawn new processes. If we use multiprocess pool (either with concurrent.futures.processpoolexecutor or multiprocess libraries), we can pass a list of parameters to the executor, which will spawn processes on its own.

Questions

This touches on two questions:

  • what is the data structure we'll use? Do we want a single array (currently .npy), in which case, we need it to be capable of concurrent writing. Should patches be written to individual files?
  • at exactly what level do we parallelize the data? Should it be possible at the finest level (Z, T, C)?
@bryantChhun
Copy link
Contributor Author

notes
need to first tackle the data format problem (start new issue)

  • for each stage of the pipeline, we should define the data type inputs and outputs better (file format, dimensionality, file name)
  • probability map and 2k x 2k images can stay large .zarr files
  1. data consistency
  2. parallelization
  3. zarr caching

can we avoid data duplication? Are there intermediate stages that can avoid data duplication?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants