-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebuild a fits manifest from an HSC data directory. #115
Conversation
mtauraso
commented
Nov 8, 2024
•
edited
Loading
edited
- Added a new verb rebuild_manifest
- When run with the HSC dataset class this verb will: 0) Scan the data directory and ingest HSC cutout files 1) Read in the original catalog file configured for download for metadata 2) Write out rebuilt_manifest.fits in the data directory
- Fixed up config resolution so that fibad_config.toml in the cwd works again for CLI invocations.
- File scanning has been parallelized with multiprocessing to reduce time from ~30d -> 6hr on hyak w 24M files
- Number of processes to launch dynamically chosen based on system limits, with intent to achieve max io bandwidth.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #115 +/- ##
==========================================
- Coverage 40.95% 40.93% -0.03%
==========================================
Files 19 21 +2
Lines 1084 1698 +614
==========================================
+ Hits 444 695 +251
- Misses 640 1003 +363 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
b390d58
to
a7c642b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments about here, I don't believe any of them are blocking, but it would be good to know the answer since they are efficiency related.
I'm guessing that rebuild_manifest is not a process that will happen often, so maybe the efficiency isn't as much of a concern, but faster is always better right?
src/fibad/rebuild_manifest.py
Outdated
|
||
data_set.rebuild_manifest(config) | ||
|
||
logger.info("Finished Prepare") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info("Finished Prepare") | |
logger.info("Finished rebuilding manifest") |
|
||
for object_id, filter, filename, dim in self._all_files_full(): | ||
for static_col in static_column_names: | ||
columns[static_col].append(static_values[static_col]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point in the manifest building process, do we know the number of files that will be yielded by self._all_files_full()
? If so, can we build out the static_col for loop in one go outside of the outer for loop?
for static_col in static_column_names:
columns[static_col] = [static_values[static_col]] * num_files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we could reverse the loops here, and that might be faster.
I'm mostly worried about memory footprint here rather than speed. I wrote it in this order in case I could integrate fitsio and stream individual objects into the fits file, rather than astropy's model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing on Hyak the entire flow takes ~6hr, with this particular loop taking ~3 minutes on a 8M object/24M file dataset.
The critical section for rebuild_manifest
is the i/o now in _fits_file_dims
and is where nearly all of the time is spent. This is also the critical section for dataset loading in general.
I'm going to pass on this optimization for now, and make it so by default we skip the f/s scan which is going to save time in the typical case where we have a manifest.
|
||
for dynamic_col in dynamic_column_names: | ||
if dynamic_col == "object_id": | ||
columns[dynamic_col].append(int(object_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roughly the same question as above - if we know before hand the total number of files, we could preallocate the arrays. I understand that python is suppose to be efficient with .append
, but the over allocation of memory will hit in clearly defined steps for each of these arrays and more than double the memory footprint each time.
That might be just fine, but if we could do the same thing here but just updating a specific index instead of .append()
ing, we might save ourselves a headache in the future.
Of course, if we don't know the number of files before hand, then just ignore this :). Perhaps prep for loop that calls the generator just to increment a counter would be worth while?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there's probably a speedup here, and we do know the size of these objects at the start of the loop.
- Added a new verb rebuild_manifest - When run with the HSC dataset class this verb will: 0) Scan the data directory and ingest HSC cutout files 1) Read in the original catalog file configured for download for metadata 2) Write out rebuilt_manifest.fits in the data directory - Fixed up config resolution so that fibad_config.toml in the cwd works again for CLI invocations.
- Attempting to gather data about why rebuild_manifest is slow - Also remove an obvious and unnecessary list copy from HSCDataSet
Using Schwimmbad and multiprocessing to parallelize extracting the dimensions of files in HSCDataSet to effect speedup in 10M+ file datasets. Not currently tuned to hyak, no speedup yet measured.
Intending to run this on hyak to tune parameters.
3aa0e85
to
a041ca9
Compare
Co-authored-by: Drew Oldag <[email protected]>
…ile scan on large datasets