Rebuild a fits manifest from an HSC data directory. #115

mtauraso · 2024-11-08T23:43:17Z

Added a new verb rebuild_manifest
When run with the HSC dataset class this verb will: 0) Scan the data directory and ingest HSC cutout files 1) Read in the original catalog file configured for download for metadata 2) Write out rebuilt_manifest.fits in the data directory
Fixed up config resolution so that fibad_config.toml in the cwd works again for CLI invocations.
File scanning has been parallelized with multiprocessing to reduce time from ~30d -> 6hr on hyak w 24M files
Number of processes to launch dynamically chosen based on system limits, with intent to achieve max io bandwidth.

codecov · 2024-11-08T23:45:55Z

Codecov Report

Attention: Patch coverage is 52.57143% with 83 lines in your changes missing coverage. Please review.

Project coverage is 40.93%. Comparing base (ae484c1) to head (f749a26).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/fibad/data_sets/hsc_data_set.py	52.44%	68 Missing ⚠️
src/fibad/rebuild_manifest.py	0.00%	9 Missing ⚠️
src/fibad/fibad.py	50.00%	3 Missing ⚠️
src/fibad/config_utils.py	86.66%	2 Missing ⚠️
src/fibad/download.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #115      +/-   ##
==========================================
- Coverage   40.95%   40.93%   -0.03%     
==========================================
  Files          19       21       +2     
  Lines        1084     1698     +614     
==========================================
+ Hits          444      695     +251     
- Misses        640     1003     +363

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

github-actions · 2024-11-08T23:46:27Z

Before [`ae484c1`]	After [`1ddf089`]	Ratio	Benchmark (Parameter)
3.46±1s	1.81±2s	~0.52	benchmarks.time_computation
3.31k	2.38k	0.72	benchmarks.mem_list

Click here to view all benchmarks.

drewoldag

I left a few comments about here, I don't believe any of them are blocking, but it would be good to know the answer since they are efficiency related.

I'm guessing that rebuild_manifest is not a process that will happen often, so maybe the efficiency isn't as much of a concern, but faster is always better right?

drewoldag · 2024-11-12T22:17:07Z

src/fibad/rebuild_manifest.py

+
+    data_set.rebuild_manifest(config)
+
+    logger.info("Finished Prepare")


Suggested change

logger.info("Finished Prepare")

logger.info("Finished rebuilding manifest")

src/fibad/data_sets/hsc_data_set.py

drewoldag · 2024-11-12T22:48:20Z

src/fibad/data_sets/hsc_data_set.py

+
+        for object_id, filter, filename, dim in self._all_files_full():
+            for static_col in static_column_names:
+                columns[static_col].append(static_values[static_col])


At this point in the manifest building process, do we know the number of files that will be yielded by self._all_files_full()? If so, can we build out the static_col for loop in one go outside of the outer for loop?

for static_col in static_column_names: columns[static_col] = [static_values[static_col]] * num_files

Yeah we could reverse the loops here, and that might be faster.

I'm mostly worried about memory footprint here rather than speed. I wrote it in this order in case I could integrate fitsio and stream individual objects into the fits file, rather than astropy's model.

Testing on Hyak the entire flow takes ~6hr, with this particular loop taking ~3 minutes on a 8M object/24M file dataset.

The critical section for rebuild_manifest is the i/o now in _fits_file_dims and is where nearly all of the time is spent. This is also the critical section for dataset loading in general.

I'm going to pass on this optimization for now, and make it so by default we skip the f/s scan which is going to save time in the typical case where we have a manifest.

drewoldag · 2024-11-12T22:58:24Z

src/fibad/data_sets/hsc_data_set.py

+
+            for dynamic_col in dynamic_column_names:
+                if dynamic_col == "object_id":
+                    columns[dynamic_col].append(int(object_id))


Roughly the same question as above - if we know before hand the total number of files, we could preallocate the arrays. I understand that python is suppose to be efficient with .append, but the over allocation of memory will hit in clearly defined steps for each of these arrays and more than double the memory footprint each time.

That might be just fine, but if we could do the same thing here but just updating a specific index instead of .append()ing, we might save ourselves a headache in the future.

Of course, if we don't know the number of files before hand, then just ignore this :). Perhaps prep for loop that calls the generator just to increment a counter would be worth while?

Yeah there's probably a speedup here, and we do know the size of these objects at the start of the loop.

- Added a new verb rebuild_manifest - When run with the HSC dataset class this verb will: 0) Scan the data directory and ingest HSC cutout files 1) Read in the original catalog file configured for download for metadata 2) Write out rebuilt_manifest.fits in the data directory - Fixed up config resolution so that fibad_config.toml in the cwd works again for CLI invocations.

- Attempting to gather data about why rebuild_manifest is slow - Also remove an obvious and unnecessary list copy from HSCDataSet

Using Schwimmbad and multiprocessing to parallelize extracting the dimensions of files in HSCDataSet to effect speedup in 10M+ file datasets. Not currently tuned to hyak, no speedup yet measured.

Intending to run this on hyak to tune parameters.

Co-authored-by: Drew Oldag <[email protected]>

…ile scan on large datasets

…limits

mtauraso requested review from aritraghsh09 and drewoldag November 8, 2024 23:43

mtauraso self-assigned this Nov 8, 2024

mtauraso linked an issue Nov 8, 2024 that may be closed by this pull request

Add ability to rebuild a manifest file #114

Closed

mtauraso force-pushed the rebuild-manifest branch from b390d58 to a7c642b Compare November 9, 2024 00:19

drewoldag reviewed Nov 12, 2024

View reviewed changes

mtauraso added 4 commits November 15, 2024 14:58

Adding logging, removing an obvious list copy.

b1acf8c

- Attempting to gather data about why rebuild_manifest is slow - Also remove an obvious and unnecessary list copy from HSCDataSet

Parallelizing _scan_file_dimensions()

3284a22

Using Schwimmbad and multiprocessing to parallelize extracting the dimensions of files in HSCDataSet to effect speedup in 10M+ file datasets. Not currently tuned to hyak, no speedup yet measured.

Added progressive log entries for HSCDataSet file scan

a041ca9

Intending to run this on hyak to tune parameters.

mtauraso force-pushed the rebuild-manifest branch from 3aa0e85 to a041ca9 Compare November 15, 2024 22:58

mtauraso and others added 7 commits November 15, 2024 17:15

committing parameters and logging tuned for Hyak with 2 cpus

ce1cf2b

Apply suggestions from code review

7ab428e

Co-authored-by: Drew Oldag <[email protected]>

Catching error on malformed fits file

3ebf755

Use manifest by default when no filter_catalog provided. This skips f…

1308716

…ile scan on large datasets

Found a way to get number of CPUs that works on hyak

9fe33f7

Choose number of processes in a way that doesn't run afoul of system …

843d5f4

…limits

Code review fixups

f749a26

mtauraso merged commit 4eb8301 into main Nov 19, 2024
8 checks passed

mtauraso deleted the rebuild-manifest branch November 19, 2024 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebuild a fits manifest from an HSC data directory. #115

Rebuild a fits manifest from an HSC data directory. #115

mtauraso commented Nov 8, 2024 •

edited

Loading

codecov bot commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024 •

edited

Loading

drewoldag left a comment

drewoldag Nov 12, 2024

drewoldag Nov 12, 2024

mtauraso Nov 14, 2024

mtauraso Nov 17, 2024

drewoldag Nov 12, 2024

mtauraso Nov 14, 2024


		data_set.rebuild_manifest(config)

		logger.info("Finished Prepare")

	logger.info("Finished Prepare")
	logger.info("Finished rebuilding manifest")

Rebuild a fits manifest from an HSC data directory. #115

Rebuild a fits manifest from an HSC data directory. #115

Conversation

mtauraso commented Nov 8, 2024 • edited Loading

codecov bot commented Nov 8, 2024 • edited Loading

Codecov Report

github-actions bot commented Nov 8, 2024 • edited Loading

drewoldag left a comment

Choose a reason for hiding this comment

drewoldag Nov 12, 2024

Choose a reason for hiding this comment

drewoldag Nov 12, 2024

Choose a reason for hiding this comment

mtauraso Nov 14, 2024

Choose a reason for hiding this comment

mtauraso Nov 17, 2024

Choose a reason for hiding this comment

drewoldag Nov 12, 2024

Choose a reason for hiding this comment

mtauraso Nov 14, 2024

Choose a reason for hiding this comment

mtauraso commented Nov 8, 2024 •

edited

Loading

codecov bot commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024 •

edited

Loading