-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter HDUs before loading to SpectrumList #1696
Conversation
Codecov ReportBase: 87.16% // Head: 87.21% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1696 +/- ##
==========================================
+ Coverage 87.16% 87.21% +0.05%
==========================================
Files 95 95
Lines 9908 9941 +33
==========================================
+ Hits 8636 8670 +34
+ Misses 1272 1271 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely think this makes a lot of sense to filter in advance, nice catch! I know this is no longer the bottleneck, but I just have one comment that might be worth try to optimize a little more while we're editing this block of code.
a0dd504
to
46510a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about you also filter out those with the missing SRCTYPE and just not try to load them, instead of going into except
? Or is that impossible to check in advance?
Maybe also safer to add a check to see if len(filtered_hdul) == 0
before you try to load them. It is possible that a given file has no qualifying HDU.
So the issue here is the desired behavior; in this scenario, I think we want to error out and not try to load a directory with missing SRCTYPE. This is because this check is specifically in the 1D parser. If we only skip over the file, then our 1D spectra file list has less elements than the 2D spectra, and there would be a mismatch. To be frank, I don't even know how Mosviz would handle lists of different sizes... Presumably we could skip the entire target and omit the 2D spectra, but I think handling that is outside the scope of specifically removing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with the actual data format from the instrument, so I am just doing a general review.
Co-authored-by: P. L. Lim <[email protected]>
Co-authored-by: P. L. Lim <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks! (Definitely would suggest a squash & merge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a slight performance increase on my machine with the data I'm testing with, but I can see it being more impactful for larger catalogs (and good to improve the logic regardless). LGTM.
Thanks for the reviews! To your point, Ricky, I should have specified that this improves performance specifically when the catalog filters on sources that are included in the fits file; If you're loading all the data, then I wouldn't expect a big improvement either |
Description
This PR improves the loadtime performance of the Mosviz NIRISS parser. It addresses a specific inefficiency brought up by @ojustino from the viz stress test hack hour regarding redundant loading. Currently, the parser loads all hdus into Spectrum1Ds via SpectrumList.read(), regardless of whether the user specified those hdus in the provided catalog. This PR modifies the logic to, instead, filter out the relevant SOURCEIDs (and metadata hdus) before passing them to specutils, rather than after. In testing, this leads to a speed up of roughly 20% in parsing time:
Before: 102033062 function calls (100890412 primitive calls) in 49.951 seconds
After: 80686669 function calls (79799537 primitive calls) in 40.327 seconds
One thing to note: unsurprisingly, the largest time-eater according to the profiling tests above is data linking. I provide the top of the profiling stack above for future reference
As an additional stretch goal, I also cleaned up some of the NIRISS parsing logic and tests. Chiefly among them, I removed the patch we previously had to force
SRCTYPE
toEXTENDED
after conversations with @camipacifici that concluded with us being able to depend on this keyword being set now from the pipeline.Change log entry
CHANGES.rst
? If you want to avoid merge conflicts,list the proposed change log here for review and add to
CHANGES.rst
before merge. If no, maintainershould add a
no-changelog-entry-needed
label.Checklist for package maintainer(s)
This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.
trivial
label.