Filter HDUs before loading to SpectrumList #1696

duytnguyendtn · 2022-10-03T21:03:55Z

Description

This PR improves the loadtime performance of the Mosviz NIRISS parser. It addresses a specific inefficiency brought up by @ojustino from the viz stress test hack hour regarding redundant loading. Currently, the parser loads all hdus into Spectrum1Ds via SpectrumList.read(), regardless of whether the user specified those hdus in the provided catalog. This PR modifies the logic to, instead, filter out the relevant SOURCEIDs (and metadata hdus) before passing them to specutils, rather than after. In testing, this leads to a speed up of roughly 20% in parsing time:

Before: 102033062 function calls (100890412 primitive calls) in 49.951 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      492    8.429    0.017   24.057    0.049 link_manager.py:54(discover_links)
    51052    8.066    0.000   10.686    0.000 link_manager.py:50(<listcomp>)
 28467365    3.619    0.000    3.619    0.000 component_link.py:208(get_from_ids)
  5960297    1.305    0.000    1.305    0.000 link_manager.py:82(<listcomp>)
     2600    1.240    0.000    6.484    0.002 misc.py:424(did_you_mean)
  1469000    1.142    0.000    1.950    0.000 difflib.py:651(real_quick_ratio)
11567329/11410347    1.122    0.000    2.500    0.000 {built-in method builtins.len}
  5963687    0.987    0.000    0.989    0.000 {built-in method builtins.max}
  5960297    0.967    0.000    0.967    0.000 component_link.py:235(get_to_id)
   308800    0.839    0.000    1.225    0.000 difflib.py:622(quick_ratio)
     2600    0.806    0.000    4.469    0.002 difflib.py:666(get_close_matches)
   212004    0.647    0.000    2.444    0.000 configuration.py:406(__call__)
436504/109126    0.552    0.000    0.730    0.000 app.py:1105(find_viewer_item)
  3754403    0.517    0.000    0.598    0.000 {built-in method builtins.isinstance}
   412181    0.504    0.000    0.504    0.000 {method 'match' of 're.Pattern' objects}
   262582    0.496    0.000    0.847    0.000 configuration.py:510(get_config)
  3766242    0.484    0.000    0.485    0.000 {method 'get' of 'dict' objects}
  1471600    0.436    0.000    0.436    0.000 difflib.py:196(set_seq1)
  1777800    0.336    0.000    0.336    0.000 difflib.py:39(_calculate_ratio)
  2004015    0.323    0.000    0.323    0.000 {method 'setdefault' of 'dict' objects}
  2887274    0.321    0.000    0.321    0.000 {method 'append' of 'list' objects}
    14604    0.268    0.000    0.273    0.000 header.py:1839(_updateindices)

After: 80686669 function calls (79799537 primitive calls) in 40.327 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      492    9.919    0.020   26.508    0.054 link_manager.py:54(discover_links)
    51052    8.431    0.000   11.246    0.000 link_manager.py:50(<listcomp>)
 29403860    3.767    0.000    3.767    0.000 component_link.py:208(get_from_ids)
  6896792    1.539    0.000    1.539    0.000 link_manager.py:82(<listcomp>)
  6900176    1.187    0.000    1.190    0.000 {built-in method builtins.max}
8964337/8825777    0.903    0.000    2.434    0.000 {built-in method builtins.len}
  6896792    0.876    0.000    0.876    0.000 component_link.py:235(get_to_id)
436504/109126    0.579    0.000    0.763    0.000 app.py:1105(find_viewer_item)
   121658    0.345    0.000    0.514    0.000 configuration.py:510(get_config)
  2391432    0.334    0.000    0.386    0.000 {built-in method builtins.isinstance}
    99676    0.307    0.000    1.283    0.000 configuration.py:406(__call__)
  2186484    0.282    0.000    0.283    0.000 {method 'get' of 'dict' objects}
      520    0.246    0.000    1.281    0.002 misc.py:424(did_you_mean)
   157935    0.241    0.000    0.241    0.000 {method 'match' of 're.Pattern' objects}
   110433    0.237    0.000    0.443    0.000 containers.py:174(_default_getter)
   293800    0.225    0.000    0.386    0.000 difflib.py:651(real_quick_ratio)
     1628    0.177    0.000    0.598    0.000 header.py:340(fromstring)
    61760    0.169    0.000    0.245    0.000 difflib.py:622(quick_ratio)
      520    0.162    0.000    0.884    0.002 difflib.py:666(get_close_matches)

One thing to note: unsurprisingly, the largest time-eater according to the profiling tests above is data linking. I provide the top of the profiling stack above for future reference

As an additional stretch goal, I also cleaned up some of the NIRISS parsing logic and tests. Chiefly among them, I removed the patch we previously had to force SRCTYPE to EXTENDED after conversations with @camipacifici that concluded with us being able to depend on this keyword being set now from the pipeline.

Change log entry

Is a change log needed? If yes, is it added to CHANGES.rst? If you want to avoid merge conflicts,
list the proposed change log here for review and add to CHANGES.rst before merge. If no, maintainer
should add a no-changelog-entry-needed label.

Checklist for package maintainer(s)

This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.

Are two approvals required? Branch protection rule does not check for the second approval. If a second approval is not necessary, please apply the trivial label.
Do the proposed changes actually accomplish desired goals? Also manually run the affected example notebooks, if necessary.
Do the proposed changes follow the STScI Style Guides?
Are tests added/updated as required? If so, do they follow the STScI Style Guides?
Are docs added/updated as required? If so, do they follow the STScI Style Guides?
Did the CI pass? If not, are the failures related?
Is a milestone set?
After merge, any internal documentations need updating (e.g., JIRA, Innerspace)?

codecov · 2022-10-03T21:15:42Z

Codecov Report

Base: 87.16% // Head: 87.21% // Increases project coverage by +0.05% 🎉

Coverage data is based on head (7ab47d1) compared to base (4c2a3dc).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1696      +/-   ##
==========================================
+ Coverage   87.16%   87.21%   +0.05%     
==========================================
  Files          95       95              
  Lines        9908     9941      +33     
==========================================
+ Hits         8636     8670      +34     
+ Misses       1272     1271       -1

Impacted Files	Coverage Δ
jdaviz/configs/mosviz/plugins/parsers.py	`88.27% <100.00%> (+0.30%)`	⬆️
jdaviz/core/template_mixin.py	`92.61% <0.00%> (-0.09%)`	⬇️
...igs/default/plugins/subset_plugin/subset_plugin.py	`97.91% <0.00%> (+0.02%)`	⬆️
...z/configs/default/plugins/line_lists/line_lists.py	`74.67% <0.00%> (+0.33%)`	⬆️
jdaviz/core/user_api.py	`88.00% <0.00%> (+0.50%)`	⬆️
...default/plugins/gaussian_smooth/gaussian_smooth.py	`98.86% <0.00%> (+2.39%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

kecnry

Definitely think this makes a lot of sense to filter in advance, nice catch! I know this is no longer the bottleneck, but I just have one comment that might be worth try to optimize a little more while we're editing this block of code.

jdaviz/configs/mosviz/plugins/parsers.py

pllim

How about you also filter out those with the missing SRCTYPE and just not try to load them, instead of going into except? Or is that impossible to check in advance?

Maybe also safer to add a check to see if len(filtered_hdul) == 0 before you try to load them. It is possible that a given file has no qualifying HDU.

jdaviz/configs/mosviz/plugins/parsers.py

duytnguyendtn · 2022-10-06T17:20:31Z

filter out those with the missing SRCTYPE and just not try to load them

So the issue here is the desired behavior; in this scenario, I think we want to error out and not try to load a directory with missing SRCTYPE. This is because this check is specifically in the 1D parser. If we only skip over the file, then our 1D spectra file list has less elements than the 2D spectra, and there would be a mismatch. To be frank, I don't even know how Mosviz would handle lists of different sizes...

Presumably we could skip the entire target and omit the 2D spectra, but I think handling that is outside the scope of specifically removing the SRCTYPE hack

jdaviz/configs/mosviz/tests/test_parsers.py

pllim

I am not familiar with the actual data format from the instrument, so I am just doing a general review.

jdaviz/configs/mosviz/tests/test_parsers.py

Co-authored-by: P. L. Lim <[email protected]>

jdaviz/configs/mosviz/tests/test_parsers.py

Co-authored-by: P. L. Lim <[email protected]>

kecnry

LGTM now, thanks! (Definitely would suggest a squash & merge)

rosteen

Only a slight performance increase on my machine with the data I'm testing with, but I can see it being more impactful for larger catalogs (and good to improve the logic regardless). LGTM.

duytnguyendtn · 2022-10-07T20:40:58Z

Thanks for the reviews! To your point, Ricky, I should have specified that this improves performance specifically when the catalog filters on sources that are included in the fits file; If you're loading all the data, then I wouldn't expect a big improvement either

duytnguyendtn added the mosviz label Oct 3, 2022

duytnguyendtn requested review from rosteen, javerbukh, ojustino, pllim and kecnry as code owners October 3, 2022 21:03

kecnry reviewed Oct 4, 2022

View reviewed changes

jdaviz/configs/mosviz/plugins/parsers.py Outdated Show resolved Hide resolved

duytnguyendtn added 6 commits October 5, 2022 12:22

Filter HDUs before loading to SpectrumList

916b173

Copy over metadata hdus manually

ebdeb1c

Pop metahdus to remove them from loop

e9296d9

Simplify hdu filter and remove SRCTYPE EXTENDED hack

bdd5875

Update NIRISS parser test with new data

99c0a4f

Remove duplicate test

46510a9

duytnguyendtn force-pushed the mosvizcat branch from a0dd504 to 46510a9 Compare October 5, 2022 16:25

duytnguyendtn added 2 commits October 5, 2022 14:34

Add missing srctype unittest

1578e68

Codestyle

c7cebed

duytnguyendtn requested a review from kecnry October 5, 2022 18:50

pllim added this to the 2.11 milestone Oct 5, 2022

pllim added bug Something isn't working performance Performance related labels Oct 5, 2022

pllim reviewed Oct 5, 2022

View reviewed changes

jdaviz/configs/mosviz/plugins/parsers.py Outdated Show resolved Hide resolved

Cache source_ids

ebc2cc3

duytnguyendtn added the Ready for final review label Oct 6, 2022

Change clashing variable name

cb40e3d

duytnguyendtn requested a review from pllim October 6, 2022 20:50

pllim reviewed Oct 6, 2022

View reviewed changes

jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved

pllim reviewed Oct 6, 2022

View reviewed changes

Rely on autogenerated, unique temp dir for data

7daa149

Co-authored-by: P. L. Lim <[email protected]>

pllim reviewed Oct 7, 2022

View reviewed changes

jdaviz/configs/mosviz/tests/test_parsers.py Outdated Show resolved Hide resolved

duytnguyendtn and others added 2 commits October 7, 2022 11:06

Changelog

b4384c0

Remove stray warning filter

7ab47d1

Co-authored-by: P. L. Lim <[email protected]>

kecnry approved these changes Oct 7, 2022

View reviewed changes

rosteen approved these changes Oct 7, 2022

View reviewed changes

duytnguyendtn merged commit 2dc195b into spacetelescope:main Oct 7, 2022

duytnguyendtn deleted the mosvizcat branch October 7, 2022 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter HDUs before loading to SpectrumList #1696

Filter HDUs before loading to SpectrumList #1696

duytnguyendtn commented Oct 3, 2022 •

edited by pllim

Loading

codecov bot commented Oct 3, 2022 •

edited

Loading

kecnry left a comment

pllim left a comment

duytnguyendtn commented Oct 6, 2022

pllim left a comment

kecnry left a comment •

edited

Loading

rosteen left a comment

duytnguyendtn commented Oct 7, 2022

Filter HDUs before loading to SpectrumList #1696

Filter HDUs before loading to SpectrumList #1696

Conversation

duytnguyendtn commented Oct 3, 2022 • edited by pllim Loading

Description

Change log entry

Checklist for package maintainer(s)

codecov bot commented Oct 3, 2022 • edited Loading

Codecov Report

kecnry left a comment

Choose a reason for hiding this comment

pllim left a comment

Choose a reason for hiding this comment

duytnguyendtn commented Oct 6, 2022

pllim left a comment

Choose a reason for hiding this comment

kecnry left a comment • edited Loading

Choose a reason for hiding this comment

rosteen left a comment

Choose a reason for hiding this comment

duytnguyendtn commented Oct 7, 2022

duytnguyendtn commented Oct 3, 2022 •

edited by pllim

Loading

codecov bot commented Oct 3, 2022 •

edited

Loading

kecnry left a comment •

edited

Loading