Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Richardson Lucy Parallelization V2 #274

Open
wants to merge 50 commits into
base: develop
Choose a base branch
from

Conversation

avalluvan
Copy link

Based on feedback that I received on version 1 of RL parallelization, I have incorporated a new setup.

RichardsonLucy.py

  • Introduction of a new object comm that handles all MPI communication if a MPI descriptor is passed as an argument during initialization
  • No changes to existing structure. Features such as acceleration parameter, background normalization, etc. re-use existing code
  • 1 synchronization barrier at initialization (summed exposure map) and 3 barriers at every iteration (end of Estep - expectation, end of Mstep - delta_model, and postprocessing broadcast of updated model) to support message passing.
  • Introduction of "slice" variables to work with the partial vectors that are generated at each node
  • Final output is exactly similar to earlier
  • Works in both serial and parallel modes
  • Serial mode (and single node parallel mode, which is effectively a serial code) can be accessed via a jupyter notebook (including the existing tutorial). For parallel mode, use RLparallelscript.py

DataInterfaceWithParallelSupport.py

  • An iterative update for the data interface
  • Instead of passing the data directly, one needs to pass the event, bg and image response filenames.
  • Also pass the comm object
  • Limitation: I am working towards expanding this to handling the FullDetectorResponse file. The NUMROWS and NUMCOLS are currently fixed but will use FullDetectorResponse.npix in the future
  • The object dataset returned by this new module works exactly the same way as the DataInterfaceDC2 module. Pass it to image_deconvolution through ImageDeconvolution.set_dataset([dataset])
  • Limitation: Many ad-hoc fixes for sliced data formats. Could be fixed with specific features from histpy.Histogram if they exist. Multiple instances of object reconstruction was required.
  • Ideally, two response files should be stores (the usual and the transposed version) to maximize speed upgrades from column major data reads.
  • Functions like calc_likelihood, calc_expectation, etc. remain the same except minor changes from "full" variables to "sliced" variables to accommodate each node's smaller execution size.

RLparallelscript.py

  • A completely refreshed version of RLparallelscript.py
  • Invoked as mpiexec -n <number of processes> python RLparallelscript.py

New files: RichardsonLucyParallel.py and RLparallelscript.py
Potentially modified files: dataIF_COSI_DC2.py
deconvolution_algorithm_base.py
image_deconvolution_data_interface_base.py
image_deconvolution.py
model_base.py
RichardsonLucySimple.py and RichardsonLucy.py were modified to include the propagation of the config file from the user facing image_deconvolution object to the respective deconvolution algorithms
…ed to subsequently overwrite remote.

Merge remote-tracking branch 'refs/remotes/origin/develop' into develop
Can also work with eps-to-Em mapping. Need to generalize
Interpolated scheme in get_point_source_response() tested and works as intended.
…interface and main script to test the implementation
Create new RLparallelscript.py with MPI porting capabilities
Update dataIFWithParallelSupport.py to cull unnecessary for loops
Fixed bugs with summed_exposure_map(needs to be summed across processes)
and dict_bkg_norm (was only being updated in MASTER node)
…pports

parallel execution with a simple change to DataIF.
Next task is to generalize DataIF
@avalluvan avalluvan added this to the v4.0 - DC4 milestone Dec 13, 2024
@avalluvan avalluvan self-assigned this Dec 13, 2024
@avalluvan
Copy link
Author

Currently, DC2 (existing) and Parallel (new) Data Interfaces can be used interchangeably for serial code. They produce the same output. However, the latter must be used for parallel code.

@israelmcmc
Copy link
Collaborator

Thanks @avalluvan! I think it's a great improvement with respect to V1. I still need to look at the code in detail, but I read your description and checked the files changed. A few first impressions:

  • About these two limitations:

Limitation: I am working towards expanding this to handling the FullDetectorResponse file. The NUMROWS and NUMCOLS are currently fixed but will use FullDetectorResponse.npix in the future
Limitation: Many ad-hoc fixes for sliced data formats. Could be fixed with specific features from histpy.Histogram if they exist. Multiple instances of object reconstruction was required.

It’s good that you open the PR so we can start the review, but I’d wait for these two limitations to be resolved before merging.

  • Can you please clarify why you needed to modify RichardsonLucy.py? In the example I shared on Richardson Lucy Parallelization #237 I was able to use MPI only by changing the data interface (all barriers were contained in the data interface of that class). I'd prefer to modify as few parts of the code as possible unless really necessary.

  • In a similar spirit, I see that many other file changes seem unrelated or not needed for the RL parallelization, e.g., notebooks, config files, your personal comments in code, and probably some features related to the new response handling. Can you please keep only the changes needed for the features in this PR? That will facilitate the review. I'd be happy to help if you have questions on how to deal with git to achieve this.

…llel

Three instances (all pertaining to saving results) remain in RichardsonLucy class.
@avalluvan
Copy link
Author

On point 1, I have updated the code to migrate most parallelization features to dataIFWithParallelSupport.py. There are three instances where the RichardsonLucy class needs to know which node is running it (serial mode/MASTER node or any other node), all of them pertaining to how results are processed. register_results() can, in principle, be stored in each node, however, if all the nodes were to save it to disk then that may cause unnecessary R/W slowdowns. Instead of passing the comm object to the RichardsonLucy class, do you suggest adding a flag variable? If yes, where should I add that - image_deconvolution.py or RichardsonLucy.py

I have resolved the merge conflicts in the response handling code. I had added a few comments to parts of the imaging code that took me a while to figure out for easier reading. Do you want me to remove those? The tutorial notebooks were probably modified greatly and I would not want to commit those changes to the develop branch.

Do you think we should wait till the dataIF code is modified for DC3 and handles FullDetectorResponse objects properly? The current pull request adds a feature for parallel execution on top of the existing DC2 imaging codes, and I think could be merged as an iterative update.

Copy link
Contributor

@hiyoneda hiyoneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @avalluvan. I added some comments on your RL codes directly.

By the way, I noticed that you changed some classes which are probably not related to the RL parallelization itself. For example, FullDetectorResponse, SpacecraftFile, PointSourceResponse. I am concerning that reviewing these different issues simultaneously may cause mistakes easily. So, is it possible to separate them from this PR? Then, we can review this PR more easily.

cosipy/image_deconvolution/RichardsonLucy.py Show resolved Hide resolved
cosipy/image_deconvolution/RichardsonLucy.py Show resolved Hide resolved
# expected count histograms
self.expectation_list = self.calc_expectation_list(model = self.initial_model, dict_bkg_norm = self.dict_bkg_norm)
logger.info("The expected count histograms were calculated with the initial model map.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to keep these lines? To use the updated model for the likelihood calculation, I wanted to perform the expected count calculation at the post-processing and the initialization step and skip it in the Estep.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can undo these changes. Do you plan on moving this to Estep() in the future / removing Estep() altogether?

@@ -66,16 +70,26 @@ def __init__(self, initial_model, dataset, mask, parameter):
else:
os.makedirs(self.save_results_directory)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that RL needs to know if it is performed on the master node and needs this kind of parameter. I would suggest preparing two parameters alternatively, something like

  • self.parallel_computation = True / False
  • self.master_node = True / False

I want to prepare a parameter that explicitly tells if the computation is in parallel or not. I will add some suggestions regarding these changes at other lines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the ideas we discussed in a previous meeting was to let the program directly infer if it was being run in serial or parallel mode. In fact, the suggested flag variables were what I used in the initial V2 pull request code. Do you recommend making this modification, i.e, inferring self.parallel_computation in image_deconvolution.py or in RichardsonLucy.py. The issue with inferring this in the image deconvolution class is - what happens when we have multiple input datasets? --> [dataset1, dataset2, ...], each dataset will have its own "sub_comm" object.

cosipy/image_deconvolution/RichardsonLucy.py Show resolved Hide resolved
cosipy/image_deconvolution/RichardsonLucy.py Show resolved Hide resolved
cosipy/image_deconvolution/allskyimage.py Outdated Show resolved Hide resolved
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why this file is showing up in this pull request.

cosipy/image_deconvolution/allskyimage.py Outdated Show resolved Hide resolved
@@ -66,16 +70,26 @@ def __init__(self, initial_model, dataset, mask, parameter):
else:
os.makedirs(self.save_results_directory)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the ideas we discussed in a previous meeting was to let the program directly infer if it was being run in serial or parallel mode. In fact, the suggested flag variables were what I used in the initial V2 pull request code. Do you recommend making this modification, i.e, inferring self.parallel_computation in image_deconvolution.py or in RichardsonLucy.py. The issue with inferring this in the image deconvolution class is - what happens when we have multiple input datasets? --> [dataset1, dataset2, ...], each dataset will have its own "sub_comm" object.

# expected count histograms
self.expectation_list = self.calc_expectation_list(model = self.initial_model, dict_bkg_norm = self.dict_bkg_norm)
logger.info("The expected count histograms were calculated with the initial model map.")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can undo these changes. Do you plan on moving this to Estep() in the future / removing Estep() altogether?

cosipy/image_deconvolution/RichardsonLucy.py Show resolved Hide resolved
cosipy/image_deconvolution/RichardsonLucy.py Show resolved Hide resolved
Copy link
Author

@avalluvan avalluvan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all changes. All files except point_source_injector.ipynb are intact.

@avalluvan
Copy link
Author

It looks like the unit tests are failing because mpi4py is not part of the package. How do I rectify that?

@israelmcmc israelmcmc closed this Dec 16, 2024
@israelmcmc israelmcmc reopened this Dec 16, 2024
@israelmcmc
Copy link
Collaborator

Thanks @avalluvan . I haven't checked all of this yet, but about this:

It looks like the unit tests are failing because mpi4py is not part of the package. How do I rectify that?
In principle, you should update the requirements here:

install_requires = ["histpy",

However, mpi4py is a special case, because it needs to have the backend MPI installed, which I don't think you can do with pip (I used conda). One option is

  • Since most users won't use mpi4py, we can make it an optional package. If it's not installed, then a try/except clause will catch it during the imports. The code can only run in series in that case, which is fine.
  • The above will make the test pass, but it won't actually test the parallelization. We should probably have a custom workflow where MPI is installed to test this. It's OK with me if we leave this for future work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants