Richardson Lucy Parallelization V2 #274

avalluvan · 2024-12-13T01:46:02Z

Based on feedback that I received on version 1 of RL parallelization, I have incorporated a new setup.

RichardsonLucy.py

Introduction of a new object comm that handles all MPI communication if a MPI descriptor is passed as an argument during initialization
No changes to existing structure. Features such as acceleration parameter, background normalization, etc. re-use existing code
1 synchronization barrier at initialization (summed exposure map) and 3 barriers at every iteration (end of Estep - expectation, end of Mstep - delta_model, and postprocessing broadcast of updated model) to support message passing.
Introduction of "slice" variables to work with the partial vectors that are generated at each node
Final output is exactly similar to earlier
Works in both serial and parallel modes
Serial mode (and single node parallel mode, which is effectively a serial code) can be accessed via a jupyter notebook (including the existing tutorial). For parallel mode, use RLparallelscript.py

DataInterfaceWithParallelSupport.py

An iterative update for the data interface
Instead of passing the data directly, one needs to pass the event, bg and image response filenames.
Also pass the comm object
Limitation: I am working towards expanding this to handling the FullDetectorResponse file. The NUMROWS and NUMCOLS are currently fixed but will use FullDetectorResponse.npix in the future
The object dataset returned by this new module works exactly the same way as the DataInterfaceDC2 module. Pass it to image_deconvolution through ImageDeconvolution.set_dataset([dataset])
Limitation: Many ad-hoc fixes for sliced data formats. Could be fixed with specific features from histpy.Histogram if they exist. Multiple instances of object reconstruction was required.
Ideally, two response files should be stores (the usual and the transposed version) to maximize speed upgrades from column major data reads.
Functions like calc_likelihood, calc_expectation, etc. remain the same except minor changes from "full" variables to "sliced" variables to accommodate each node's smaller execution size.

RLparallelscript.py

A completely refreshed version of RLparallelscript.py
Invoked as mpiexec -n <number of processes> python RLparallelscript.py

…orithm_classes

…keleton from RichardsonLucy.py

…pt.py

New files: RichardsonLucyParallel.py and RLparallelscript.py Potentially modified files: dataIF_COSI_DC2.py deconvolution_algorithm_base.py image_deconvolution_data_interface_base.py image_deconvolution.py model_base.py

RichardsonLucySimple.py and RichardsonLucy.py were modified to include the propagation of the config file from the user facing image_deconvolution object to the respective deconvolution algorithms

…ed to subsequently overwrite remote. Merge remote-tracking branch 'refs/remotes/origin/develop' into develop

Switching to histpy.Histogram()

Can also work with eps-to-Em mapping. Need to generalize

Interpolated scheme in get_point_source_response() tested and works as intended.

…onse

…etectorResponse.

Feature/general response

…o other files

… custom data types

…interface and main script to test the implementation

Create new RLparallelscript.py with MPI porting capabilities Update dataIFWithParallelSupport.py to cull unnecessary for loops

Fixed bugs with summed_exposure_map(needs to be summed across processes) and dict_bkg_norm (was only being updated in MASTER node)

…pports parallel execution with a simple change to DataIF. Next task is to generalize DataIF

…as been removed. Bug fixed.

…te pull request

avalluvan · 2024-12-13T02:25:47Z

Currently, DC2 (existing) and Parallel (new) Data Interfaces can be used interchangeably for serial code. They produce the same output. However, the latter must be used for parallel code.

israelmcmc · 2024-12-13T15:13:09Z

Thanks @avalluvan! I think it's a great improvement with respect to V1. I still need to look at the code in detail, but I read your description and checked the files changed. A few first impressions:

About these two limitations:

Limitation: I am working towards expanding this to handling the FullDetectorResponse file. The NUMROWS and NUMCOLS are currently fixed but will use FullDetectorResponse.npix in the future
Limitation: Many ad-hoc fixes for sliced data formats. Could be fixed with specific features from histpy.Histogram if they exist. Multiple instances of object reconstruction was required.

It’s good that you open the PR so we can start the review, but I’d wait for these two limitations to be resolved before merging.

Can you please clarify why you needed to modify RichardsonLucy.py? In the example I shared on Richardson Lucy Parallelization #237 I was able to use MPI only by changing the data interface (all barriers were contained in the data interface of that class). I'd prefer to modify as few parts of the code as possible unless really necessary.
In a similar spirit, I see that many other file changes seem unrelated or not needed for the RL parallelization, e.g., notebooks, config files, your personal comments in code, and probably some features related to the new response handling. Can you please keep only the changes needed for the features in this PR? That will facilitate the review. I'd be happy to help if you have questions on how to deal with git to achieve this.

…llel Three instances (all pertaining to saving results) remain in RichardsonLucy class.

avalluvan · 2024-12-15T20:31:51Z

On point 1, I have updated the code to migrate most parallelization features to dataIFWithParallelSupport.py. There are three instances where the RichardsonLucy class needs to know which node is running it (serial mode/MASTER node or any other node), all of them pertaining to how results are processed. register_results() can, in principle, be stored in each node, however, if all the nodes were to save it to disk then that may cause unnecessary R/W slowdowns. Instead of passing the comm object to the RichardsonLucy class, do you suggest adding a flag variable? If yes, where should I add that - image_deconvolution.py or RichardsonLucy.py

I have resolved the merge conflicts in the response handling code. I had added a few comments to parts of the imaging code that took me a while to figure out for easier reading. Do you want me to remove those? The tutorial notebooks were probably modified greatly and I would not want to commit those changes to the develop branch.

Do you think we should wait till the dataIF code is modified for DC3 and handles FullDetectorResponse objects properly? The current pull request adds a feature for parallel execution on top of the existing DC2 imaging codes, and I think could be merged as an iterative update.

hiyoneda

Thanks, @avalluvan. I added some comments on your RL codes directly.

By the way, I noticed that you changed some classes which are probably not related to the RL parallelization itself. For example, FullDetectorResponse, SpacecraftFile, PointSourceResponse. I am concerning that reviewing these different issues simultaneously may cause mistakes easily. So, is it possible to separate them from this PR? Then, we can review this PR more easily.

cosipy/image_deconvolution/RichardsonLucy.py

hiyoneda · 2024-12-16T14:28:24Z

cosipy/image_deconvolution/RichardsonLucy.py

-        # expected count histograms
-        self.expectation_list = self.calc_expectation_list(model = self.initial_model, dict_bkg_norm = self.dict_bkg_norm)
-        logger.info("The expected count histograms were calculated with the initial model map.")
-


It is possible to keep these lines? To use the updated model for the likelihood calculation, I wanted to perform the expected count calculation at the post-processing and the initialization step and skip it in the Estep.

I can undo these changes. Do you plan on moving this to Estep() in the future / removing Estep() altogether?

hiyoneda · 2024-12-16T14:36:10Z

cosipy/image_deconvolution/RichardsonLucy.py

@@ -66,16 +70,26 @@ def __init__(self, initial_model, dataset, mask, parameter):
            else:
                os.makedirs(self.save_results_directory)



I understand that RL needs to know if it is performed on the master node and needs this kind of parameter. I would suggest preparing two parameters alternatively, something like

self.parallel_computation = True / False

self.master_node = True / False

I want to prepare a parameter that explicitly tells if the computation is in parallel or not. I will add some suggestions regarding these changes at other lines.

One of the ideas we discussed in a previous meeting was to let the program directly infer if it was being run in serial or parallel mode. In fact, the suggested flag variables were what I used in the initial V2 pull request code. Do you recommend making this modification, i.e, inferring self.parallel_computation in image_deconvolution.py or in RichardsonLucy.py. The issue with inferring this in the image deconvolution class is - what happens when we have multiple input datasets? --> [dataset1, dataset2, ...], each dataset will have its own "sub_comm" object.

cosipy/image_deconvolution/RichardsonLucy.py

cosipy/image_deconvolution/allskyimage.py

…on and master_node flags

avalluvan · 2024-12-16T18:09:48Z

docs/tutorials/source_injector/Point_source_injector.ipynb

I do not understand why this file is showing up in this pull request.

cosipy/image_deconvolution/allskyimage.py

avalluvan · 2024-12-16T18:21:17Z

cosipy/image_deconvolution/RichardsonLucy.py

@@ -66,16 +70,26 @@ def __init__(self, initial_model, dataset, mask, parameter):
            else:
                os.makedirs(self.save_results_directory)



One of the ideas we discussed in a previous meeting was to let the program directly infer if it was being run in serial or parallel mode. In fact, the suggested flag variables were what I used in the initial V2 pull request code. Do you recommend making this modification, i.e, inferring self.parallel_computation in image_deconvolution.py or in RichardsonLucy.py. The issue with inferring this in the image deconvolution class is - what happens when we have multiple input datasets? --> [dataset1, dataset2, ...], each dataset will have its own "sub_comm" object.

avalluvan · 2024-12-16T18:26:45Z

cosipy/image_deconvolution/RichardsonLucy.py

-        # expected count histograms
-        self.expectation_list = self.calc_expectation_list(model = self.initial_model, dict_bkg_norm = self.dict_bkg_norm)
-        logger.info("The expected count histograms were calculated with the initial model map.")
-


I can undo these changes. Do you plan on moving this to Estep() in the future / removing Estep() altogether?

cosipy/image_deconvolution/RichardsonLucy.py

…mage deconvolution config file.

avalluvan

Reviewed all changes. All files except point_source_injector.ipynb are intact.

avalluvan · 2024-12-16T21:02:36Z

It looks like the unit tests are failing because mpi4py is not part of the package. How do I rectify that?

israelmcmc · 2024-12-16T21:24:05Z

Thanks @avalluvan . I haven't checked all of this yet, but about this:

It looks like the unit tests are failing because mpi4py is not part of the package. How do I rectify that?
In principle, you should update the requirements here:

cosipy/setup.py

Line 25 in c87577d

install_requires = ["histpy",

However, mpi4py is a special case, because it needs to have the backend MPI installed, which I don't think you can do with pip (I used conda). One option is

Since most users won't use mpi4py, we can make it an optional package. If it's not installed, then a try/except clause will catch it during the imports. The code can only run in series in that case, which is fine.
The above will make the test pass, but it won't actually test the parallelization. We should probably have a custom workflow where MPI is installed to test this. It's OK with me if we leave this for future work.

avalluvan added 30 commits August 14, 2024 18:59

Added new option "RLparallel" to ImageDeconvolution.deconvolution_alg…

79109f3

…orithm_classes

Renamed RLparallel.py to RichardsonLucyParallel.py and adapted code s…

c918ab5

…keleton from RichardsonLucy.py

Installed mpi4py to cosipy venv and added it to skeleton

015f648

Created subprocess call. Ported MPI to separate script RLparallelscri…

f259d36

…pt.py

Added code to register results if save_results_flag is set to True

4cde4dd

End-to-end working script version commit

e98fb79

Unclear what files are causing conflicts

cd45e44

Add modified files and new RL parallel script

ef279f4

New files: RichardsonLucyParallel.py and RLparallelscript.py Potentially modified files: dataIF_COSI_DC2.py deconvolution_algorithm_base.py image_deconvolution_data_interface_base.py image_deconvolution.py model_base.py

Another tiny change

871c9a8

RichardsonLucySimple.py and RichardsonLucy.py were modified to include the propagation of the config file from the user facing image_deconvolution object to the respective deconvolution algorithms

Modified configuration file for RLparallel

27c8afe

Merge branch 'cositools:develop' into develop

f079988

git fetch changes unclear. Performing a commit to avoid data loss.

f044589

This merge may result in data losses. Copies of files have been creat…

3036a7a

…ed to subsequently overwrite remote. Merge remote-tracking branch 'refs/remotes/origin/develop' into develop

Create LMDR.ipynb and add response creation attempt with h5py.

35b8293

Switching to histpy.Histogram()

Created general code for multidimensional interpolation

d04f3c3

Attempted listmode response on the fly generation

ac47f23

Add ListModeResponse.py

c257676

Used histpy functions to dramatically simplify the codebase

7de27fd

Modified code to accept len(dr.axes) or len(dr.axes)-1 number of inputs.

4c23188

Can also work with eps-to-Em mapping. Need to generalize

Added comments to get_point_source_response()

71085fd

Tested new interpolation scheme on LMDR.ipynb

dc35034

Adhoc fix for interpolated psr calculation with deepcopy

f8bbaa4

Simplified __getitem__ and get_point_source_response()

cdcb9b0

Interpolated scheme in get_point_source_response() tested and works as intended.

Ported ListModeResponse interpolation functionalities to DetectorResp…

aaff648

…onse

Removed ListModeResponse.py. All developed features incorporated in D…

caa2d44

…etectorResponse.

Merge pull request #1 from avalluvan/feature/general_response

2c80caa

Feature/general response

Added code to compress empty axes, i.e., those with axis shape = (1,)

dafd4f8

Create new file RichardsonLucyWithParallelSupport.py. Stash changes t…

de31c0a

…o other files

Skeleton for RLWithParellelSupport.py. Need to update MPI.DOUBLE with…

ebdf1ca

… custom data types

Looks like RLWithParallelSupport.py is complete. Need to create data …

9aeba42

…interface and main script to test the implementation

avalluvan added 6 commits December 12, 2024 12:53

Archive old version of RLparallelscript.py and RichardsonLucyParallel.py

406cf9a

Create new RLparallelscript.py with MPI porting capabilities Update dataIFWithParallelSupport.py to cull unnecessary for loops

Single and multi process produce same, exact outputs!

2eb7e25

Fixed bugs with summed_exposure_map(needs to be summed across processes) and dict_bkg_norm (was only being updated in MASTER node)

Replaced old RichardsonLucy.py with new RichardsonLucy.py that now su…

1cd0d2c

…pports parallel execution with a simple change to DataIF. Next task is to generalize DataIF

Minor bug in image_deconvolution now that RichardsonLucyParallel.py h…

1f4eeed

…as been removed. Bug fixed.

Removed unnecessary imports in new RichardsonLucy.py

14c34fd

Added comments. Removed stubs of RLparallel. Ready to rebase and crea…

860b5ce

…te pull request

avalluvan added the imaging label Dec 13, 2024

avalluvan added this to the v4.0 - DC4 milestone Dec 13, 2024

avalluvan requested review from israelmcmc and hiyoneda December 13, 2024 01:46

avalluvan self-assigned this Dec 13, 2024

avalluvan added 2 commits December 15, 2024 12:08

Migrated RichardsonLucy slicing and message passing to DataIFWithPara…

63527db

…llel Three instances (all pertaining to saving results) remain in RichardsonLucy class.

Merge branch 'develop' into feature/RLparallel

82fc4f8

Bug fix in reshaping and concatenating epsilon and C slices

c085c17

hiyoneda reviewed Dec 16, 2024

View reviewed changes

avalluvan added 4 commits December 16, 2024 10:03

Replacing superfluous edits with files from cositools/develop

279f6f2

Undo changes to dataIF_COSI_DC2.py and allskyimage.py

04eb674

Undoing changes to RichardsonLucy.py Estep structure

b99ff6a

Incorporated review suggestions about placement of parallel_computati…

cddaa24

…on and master_node flags

avalluvan commented Dec 16, 2024

View reviewed changes

avalluvan added 2 commits December 16, 2024 11:29

Rename dataIFWithParallel.py to dataIF_Parallel.py. Undo changes to i…

7a2f65c

…mage deconvolution config file.

Calculate parallel and master_node flags in RLparallelscript.py

17aae47

avalluvan commented Dec 16, 2024

View reviewed changes

israelmcmc closed this Dec 16, 2024

israelmcmc reopened this Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Richardson Lucy Parallelization V2 #274

Richardson Lucy Parallelization V2 #274

avalluvan commented Dec 13, 2024

avalluvan commented Dec 13, 2024

israelmcmc commented Dec 13, 2024

avalluvan commented Dec 15, 2024

hiyoneda left a comment

hiyoneda Dec 16, 2024

avalluvan Dec 16, 2024

hiyoneda Dec 16, 2024

avalluvan Dec 16, 2024

avalluvan Dec 16, 2024

avalluvan Dec 16, 2024

avalluvan Dec 16, 2024

avalluvan left a comment

avalluvan commented Dec 16, 2024

israelmcmc commented Dec 16, 2024

		@@ -66,16 +70,26 @@ def __init__(self, initial_model, dataset, mask, parameter):
		else:
		os.makedirs(self.save_results_directory)

Richardson Lucy Parallelization V2 #274

Are you sure you want to change the base?

Richardson Lucy Parallelization V2 #274

Conversation

avalluvan commented Dec 13, 2024

avalluvan commented Dec 13, 2024

israelmcmc commented Dec 13, 2024

avalluvan commented Dec 15, 2024

hiyoneda left a comment

Choose a reason for hiding this comment

hiyoneda Dec 16, 2024

Choose a reason for hiding this comment

avalluvan Dec 16, 2024

Choose a reason for hiding this comment

hiyoneda Dec 16, 2024

Choose a reason for hiding this comment

avalluvan Dec 16, 2024

Choose a reason for hiding this comment

avalluvan Dec 16, 2024

Choose a reason for hiding this comment

avalluvan Dec 16, 2024

Choose a reason for hiding this comment

avalluvan Dec 16, 2024

Choose a reason for hiding this comment

avalluvan left a comment

Choose a reason for hiding this comment

avalluvan commented Dec 16, 2024

israelmcmc commented Dec 16, 2024