Use H5Dchunk_iter to get chunk information from HDF5 files #331
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use H5Dchunk_iter to get chunk information from HDF5 files
This pull request makes chunk information retrieval from HDF5 files more efficient, especially when more than 16K chunks are involved. Translation times drop from tens of seconds to fractions of a second.
Fix #286
Here we replace
dsid.get_chunk_info
withdsid.chunk_iter
which is available with HDF5 1.14.0 and will soon be available with HDF5 1.12.3. The HDF5 C call isH5Dchunk_iter
.The is particularly more efficient with a large number of chunks.
Before this pull request: N^2 scaling with the number of chunks
With 16,384 chunks translating takes 13 seconds. With 32,768 chunks, twice as many chunks, translating takes 74 seconds.
After this pull request: Linear scaling with the number of chunks
With 16,384 chunks, translating takes 0.131 seconds. With 32,768 chunks, twice as many chunks, translating takes 0.214 seconds.
Summary
Time for
SingleHdf5ToZarr.translate()
:get_chunk_info
chunk_iter
edit May 8: Changed After times to reflect "Total Time", added 262,144 chunks