enhanced parallelization #33

bryantChhun · 2021-06-30T22:17:27Z

problem

Each module run_patch.py run_segmentation.py etc... launches a process to parallelize across sites, but often we have a large dataset for a given site (high Z, T, C count) and are not able to parallelize each, say, timepoint.

possible solutions

Each module uses its own worker class and the python multiprocessing library to spawn new processes. If we use multiprocess pool (either with concurrent.futures.processpoolexecutor or multiprocess libraries), we can pass a list of parameters to the executor, which will spawn processes on its own.

Questions

This touches on two questions:

what is the data structure we'll use? Do we want a single array (currently .npy), in which case, we need it to be capable of concurrent writing. Should patches be written to individual files?
at exactly what level do we parallelize the data? Should it be possible at the finest level (Z, T, C)?

The text was updated successfully, but these errors were encountered:

bryantChhun · 2021-07-08T22:00:18Z

notes
need to first tackle the data format problem (start new issue)

for each stage of the pipeline, we should define the data type inputs and outputs better (file format, dimensionality, file name)
probability map and 2k x 2k images can stay large .zarr files

data consistency
parallelization
zarr caching

can we avoid data duplication? Are there intermediate stages that can avoid data duplication?

bryantChhun assigned bryantChhun and smguo Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhanced parallelization #33

enhanced parallelization #33

bryantChhun commented Jun 30, 2021 •

edited

Loading

bryantChhun commented Jul 8, 2021

enhanced parallelization #33

enhanced parallelization #33

Comments

bryantChhun commented Jun 30, 2021 • edited Loading

problem

possible solutions

Questions

bryantChhun commented Jul 8, 2021

bryantChhun commented Jun 30, 2021 •

edited

Loading