Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuild a fits manifest from an HSC data directory. #115

Merged
merged 11 commits into from
Nov 19, 2024
Merged

Conversation

mtauraso
Copy link
Collaborator

@mtauraso mtauraso commented Nov 8, 2024

  • Added a new verb rebuild_manifest
  • When run with the HSC dataset class this verb will: 0) Scan the data directory and ingest HSC cutout files 1) Read in the original catalog file configured for download for metadata 2) Write out rebuilt_manifest.fits in the data directory
  • Fixed up config resolution so that fibad_config.toml in the cwd works again for CLI invocations.
  • File scanning has been parallelized with multiprocessing to reduce time from ~30d -> 6hr on hyak w 24M files
  • Number of processes to launch dynamically chosen based on system limits, with intent to achieve max io bandwidth.

@mtauraso mtauraso self-assigned this Nov 8, 2024
@mtauraso mtauraso linked an issue Nov 8, 2024 that may be closed by this pull request
Copy link

codecov bot commented Nov 8, 2024

Codecov Report

Attention: Patch coverage is 52.57143% with 83 lines in your changes missing coverage. Please review.

Project coverage is 40.93%. Comparing base (ae484c1) to head (f749a26).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/fibad/data_sets/hsc_data_set.py 52.44% 68 Missing ⚠️
src/fibad/rebuild_manifest.py 0.00% 9 Missing ⚠️
src/fibad/fibad.py 50.00% 3 Missing ⚠️
src/fibad/config_utils.py 86.66% 2 Missing ⚠️
src/fibad/download.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #115      +/-   ##
==========================================
- Coverage   40.95%   40.93%   -0.03%     
==========================================
  Files          19       21       +2     
  Lines        1084     1698     +614     
==========================================
+ Hits          444      695     +251     
- Misses        640     1003     +363     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link

github-actions bot commented Nov 8, 2024

Before [ae484c1] After [1ddf089] Ratio Benchmark (Parameter)
3.46±1s 1.81±2s ~0.52 benchmarks.time_computation
3.31k 2.38k 0.72 benchmarks.mem_list

Click here to view all benchmarks.

Copy link
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments about here, I don't believe any of them are blocking, but it would be good to know the answer since they are efficiency related.

I'm guessing that rebuild_manifest is not a process that will happen often, so maybe the efficiency isn't as much of a concern, but faster is always better right?


data_set.rebuild_manifest(config)

logger.info("Finished Prepare")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("Finished Prepare")
logger.info("Finished rebuilding manifest")

src/fibad/data_sets/hsc_data_set.py Outdated Show resolved Hide resolved

for object_id, filter, filename, dim in self._all_files_full():
for static_col in static_column_names:
columns[static_col].append(static_values[static_col])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in the manifest building process, do we know the number of files that will be yielded by self._all_files_full()? If so, can we build out the static_col for loop in one go outside of the outer for loop?

for static_col in static_column_names:
    columns[static_col] = [static_values[static_col]] * num_files

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we could reverse the loops here, and that might be faster.

I'm mostly worried about memory footprint here rather than speed. I wrote it in this order in case I could integrate fitsio and stream individual objects into the fits file, rather than astropy's model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing on Hyak the entire flow takes ~6hr, with this particular loop taking ~3 minutes on a 8M object/24M file dataset.

The critical section for rebuild_manifest is the i/o now in _fits_file_dims and is where nearly all of the time is spent. This is also the critical section for dataset loading in general.

I'm going to pass on this optimization for now, and make it so by default we skip the f/s scan which is going to save time in the typical case where we have a manifest.


for dynamic_col in dynamic_column_names:
if dynamic_col == "object_id":
columns[dynamic_col].append(int(object_id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly the same question as above - if we know before hand the total number of files, we could preallocate the arrays. I understand that python is suppose to be efficient with .append, but the over allocation of memory will hit in clearly defined steps for each of these arrays and more than double the memory footprint each time.

That might be just fine, but if we could do the same thing here but just updating a specific index instead of .append()ing, we might save ourselves a headache in the future.

Of course, if we don't know the number of files before hand, then just ignore this :). Perhaps prep for loop that calls the generator just to increment a counter would be worth while?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there's probably a speedup here, and we do know the size of these objects at the start of the loop.

- Added a new verb rebuild_manifest
- When run with the HSC dataset class this verb will:
  0) Scan the data directory and ingest HSC cutout files
  1) Read in the original catalog file configured for download for metadata
  2) Write out rebuilt_manifest.fits in the data directory

- Fixed up config resolution so that fibad_config.toml in the cwd
  works again for CLI invocations.
- Attempting to gather data about why rebuild_manifest is slow
- Also remove an obvious and unnecessary list copy from HSCDataSet
Using Schwimmbad and multiprocessing to parallelize extracting
the dimensions of files in HSCDataSet to effect speedup in 10M+
file datasets.

Not currently tuned to hyak, no speedup yet measured.
Intending to run this on hyak to tune parameters.
@mtauraso mtauraso merged commit 4eb8301 into main Nov 19, 2024
8 checks passed
@mtauraso mtauraso deleted the rebuild-manifest branch November 19, 2024 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to rebuild a manifest file
2 participants