Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter HDUs before loading to SpectrumList #1696

Merged
merged 13 commits into from
Oct 7, 2022

Conversation

duytnguyendtn
Copy link
Collaborator

@duytnguyendtn duytnguyendtn commented Oct 3, 2022

Description

This PR improves the loadtime performance of the Mosviz NIRISS parser. It addresses a specific inefficiency brought up by @ojustino from the viz stress test hack hour regarding redundant loading. Currently, the parser loads all hdus into Spectrum1Ds via SpectrumList.read(), regardless of whether the user specified those hdus in the provided catalog. This PR modifies the logic to, instead, filter out the relevant SOURCEIDs (and metadata hdus) before passing them to specutils, rather than after. In testing, this leads to a speed up of roughly 20% in parsing time:

Before: 102033062 function calls (100890412 primitive calls) in 49.951 seconds
   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      492    8.429    0.017   24.057    0.049 link_manager.py:54(discover_links)
    51052    8.066    0.000   10.686    0.000 link_manager.py:50(<listcomp>)
 28467365    3.619    0.000    3.619    0.000 component_link.py:208(get_from_ids)
  5960297    1.305    0.000    1.305    0.000 link_manager.py:82(<listcomp>)
     2600    1.240    0.000    6.484    0.002 misc.py:424(did_you_mean)
  1469000    1.142    0.000    1.950    0.000 difflib.py:651(real_quick_ratio)
11567329/11410347    1.122    0.000    2.500    0.000 {built-in method builtins.len}
  5963687    0.987    0.000    0.989    0.000 {built-in method builtins.max}
  5960297    0.967    0.000    0.967    0.000 component_link.py:235(get_to_id)
   308800    0.839    0.000    1.225    0.000 difflib.py:622(quick_ratio)
     2600    0.806    0.000    4.469    0.002 difflib.py:666(get_close_matches)
   212004    0.647    0.000    2.444    0.000 configuration.py:406(__call__)
436504/109126    0.552    0.000    0.730    0.000 app.py:1105(find_viewer_item)
  3754403    0.517    0.000    0.598    0.000 {built-in method builtins.isinstance}
   412181    0.504    0.000    0.504    0.000 {method 'match' of 're.Pattern' objects}
   262582    0.496    0.000    0.847    0.000 configuration.py:510(get_config)
  3766242    0.484    0.000    0.485    0.000 {method 'get' of 'dict' objects}
  1471600    0.436    0.000    0.436    0.000 difflib.py:196(set_seq1)
  1777800    0.336    0.000    0.336    0.000 difflib.py:39(_calculate_ratio)
  2004015    0.323    0.000    0.323    0.000 {method 'setdefault' of 'dict' objects}
  2887274    0.321    0.000    0.321    0.000 {method 'append' of 'list' objects}
    14604    0.268    0.000    0.273    0.000 header.py:1839(_updateindices)
After: 80686669 function calls (79799537 primitive calls) in 40.327 seconds
   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      492    9.919    0.020   26.508    0.054 link_manager.py:54(discover_links)
    51052    8.431    0.000   11.246    0.000 link_manager.py:50(<listcomp>)
 29403860    3.767    0.000    3.767    0.000 component_link.py:208(get_from_ids)
  6896792    1.539    0.000    1.539    0.000 link_manager.py:82(<listcomp>)
  6900176    1.187    0.000    1.190    0.000 {built-in method builtins.max}
8964337/8825777    0.903    0.000    2.434    0.000 {built-in method builtins.len}
  6896792    0.876    0.000    0.876    0.000 component_link.py:235(get_to_id)
436504/109126    0.579    0.000    0.763    0.000 app.py:1105(find_viewer_item)
   121658    0.345    0.000    0.514    0.000 configuration.py:510(get_config)
  2391432    0.334    0.000    0.386    0.000 {built-in method builtins.isinstance}
    99676    0.307    0.000    1.283    0.000 configuration.py:406(__call__)
  2186484    0.282    0.000    0.283    0.000 {method 'get' of 'dict' objects}
      520    0.246    0.000    1.281    0.002 misc.py:424(did_you_mean)
   157935    0.241    0.000    0.241    0.000 {method 'match' of 're.Pattern' objects}
   110433    0.237    0.000    0.443    0.000 containers.py:174(_default_getter)
   293800    0.225    0.000    0.386    0.000 difflib.py:651(real_quick_ratio)
     1628    0.177    0.000    0.598    0.000 header.py:340(fromstring)
    61760    0.169    0.000    0.245    0.000 difflib.py:622(quick_ratio)
      520    0.162    0.000    0.884    0.002 difflib.py:666(get_close_matches)

One thing to note: unsurprisingly, the largest time-eater according to the profiling tests above is data linking. I provide the top of the profiling stack above for future reference

As an additional stretch goal, I also cleaned up some of the NIRISS parsing logic and tests. Chiefly among them, I removed the patch we previously had to force SRCTYPE to EXTENDED after conversations with @camipacifici that concluded with us being able to depend on this keyword being set now from the pipeline.

Change log entry

  • Is a change log needed? If yes, is it added to CHANGES.rst? If you want to avoid merge conflicts,
    list the proposed change log here for review and add to CHANGES.rst before merge. If no, maintainer
    should add a no-changelog-entry-needed label.

Checklist for package maintainer(s)

This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.

  • Are two approvals required? Branch protection rule does not check for the second approval. If a second approval is not necessary, please apply the trivial label.
  • Do the proposed changes actually accomplish desired goals? Also manually run the affected example notebooks, if necessary.
  • Do the proposed changes follow the STScI Style Guides?
  • Are tests added/updated as required? If so, do they follow the STScI Style Guides?
  • Are docs added/updated as required? If so, do they follow the STScI Style Guides?
  • Did the CI pass? If not, are the failures related?
  • Is a milestone set?
  • After merge, any internal documentations need updating (e.g., JIRA, Innerspace)?

@codecov
Copy link

codecov bot commented Oct 3, 2022

Codecov Report

Base: 87.16% // Head: 87.21% // Increases project coverage by +0.05% 🎉

Coverage data is based on head (7ab47d1) compared to base (4c2a3dc).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1696      +/-   ##
==========================================
+ Coverage   87.16%   87.21%   +0.05%     
==========================================
  Files          95       95              
  Lines        9908     9941      +33     
==========================================
+ Hits         8636     8670      +34     
+ Misses       1272     1271       -1     
Impacted Files Coverage Δ
jdaviz/configs/mosviz/plugins/parsers.py 88.27% <100.00%> (+0.30%) ⬆️
jdaviz/core/template_mixin.py 92.61% <0.00%> (-0.09%) ⬇️
...igs/default/plugins/subset_plugin/subset_plugin.py 97.91% <0.00%> (+0.02%) ⬆️
...z/configs/default/plugins/line_lists/line_lists.py 74.67% <0.00%> (+0.33%) ⬆️
jdaviz/core/user_api.py 88.00% <0.00%> (+0.50%) ⬆️
...default/plugins/gaussian_smooth/gaussian_smooth.py 98.86% <0.00%> (+2.39%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@kecnry kecnry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely think this makes a lot of sense to filter in advance, nice catch! I know this is no longer the bottleneck, but I just have one comment that might be worth try to optimize a little more while we're editing this block of code.

jdaviz/configs/mosviz/plugins/parsers.py Outdated Show resolved Hide resolved
@duytnguyendtn duytnguyendtn requested a review from kecnry October 5, 2022 18:50
@pllim pllim added this to the 2.11 milestone Oct 5, 2022
@pllim pllim added bug Something isn't working performance Performance related labels Oct 5, 2022
Copy link
Contributor

@pllim pllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about you also filter out those with the missing SRCTYPE and just not try to load them, instead of going into except? Or is that impossible to check in advance?

Maybe also safer to add a check to see if len(filtered_hdul) == 0 before you try to load them. It is possible that a given file has no qualifying HDU.

jdaviz/configs/mosviz/plugins/parsers.py Outdated Show resolved Hide resolved
@duytnguyendtn
Copy link
Collaborator Author

filter out those with the missing SRCTYPE and just not try to load them

So the issue here is the desired behavior; in this scenario, I think we want to error out and not try to load a directory with missing SRCTYPE. This is because this check is specifically in the 1D parser. If we only skip over the file, then our 1D spectra file list has less elements than the 2D spectra, and there would be a mismatch. To be frank, I don't even know how Mosviz would handle lists of different sizes...

Presumably we could skip the entire target and omit the 2D spectra, but I think handling that is outside the scope of specifically removing the SRCTYPE hack

@duytnguyendtn duytnguyendtn requested a review from pllim October 6, 2022 20:50
Copy link
Contributor

@pllim pllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the actual data format from the instrument, so I am just doing a general review.

jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved
jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved
jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved
jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved
jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved
jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved
jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved
jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved
jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved
Copy link
Member

@kecnry kecnry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks! (Definitely would suggest a squash & merge)

Copy link
Collaborator

@rosteen rosteen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a slight performance increase on my machine with the data I'm testing with, but I can see it being more impactful for larger catalogs (and good to improve the logic regardless). LGTM.

@duytnguyendtn duytnguyendtn merged commit 2dc195b into spacetelescope:main Oct 7, 2022
@duytnguyendtn
Copy link
Collaborator Author

Thanks for the reviews! To your point, Ricky, I should have specified that this improves performance specifically when the catalog filters on sources that are included in the fits file; If you're loading all the data, then I wouldn't expect a big improvement either

@duytnguyendtn duytnguyendtn deleted the mosvizcat branch October 7, 2022 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mosviz performance Performance related Ready for final review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants