Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use H5Dchunk_iter to get chunk information from HDF5 files #331

Merged
merged 2 commits into from
May 3, 2023

Conversation

mkitti
Copy link
Contributor

@mkitti mkitti commented May 3, 2023

Use H5Dchunk_iter to get chunk information from HDF5 files

This pull request makes chunk information retrieval from HDF5 files more efficient, especially when more than 16K chunks are involved. Translation times drop from tens of seconds to fractions of a second.

Fix #286

Here we replace dsid.get_chunk_info with dsid.chunk_iter which is available with HDF5 1.14.0 and will soon be available with HDF5 1.12.3. The HDF5 C call is H5Dchunk_iter.

The is particularly more efficient with a large number of chunks.

Before this pull request: N^2 scaling with the number of chunks

With 16,384 chunks translating takes 13 seconds. With 32,768 chunks, twice as many chunks, translating takes 74 seconds.

In [1]: import kerchunk.hdf, fsspec, h5py

In [2]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*2,1024*2), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...:
16384

In [3]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...:
CPU times: user 13.3 s, sys: 6.91 ms, total: 13.3 s
Wall time: 13.3 s

In [4]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*4,1024*2), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...:
32768

In [5]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...:
CPU times: user 1min 14s, sys: 34.1 ms, total: 1min 14s
Wall time: 1min 14s

In [6]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*4,1024*4), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
65536

In [7]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 

CPU times: user 6min 33s, sys: 275 ms, total: 6min 33s
Wall time: 6min 33s

In [8]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*8,1024*8), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
262144

In [9]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 1h 44min 51s, sys: 1.92 s, total: 1h 44min 53s
Wall time: 1d 8h 43min 40s

After this pull request: Linear scaling with the number of chunks

With 16,384 chunks, translating takes 0.131 seconds. With 32,768 chunks, twice as many chunks, translating takes 0.214 seconds.

In [1]: import kerchunk.hdf, fsspec, h5py

In [2]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*2,1024*2), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
16384

In [3]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 131 ms, sys: 23.8 ms, total: 155 ms
Wall time: 154 ms

In [4]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*4,1024*2), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
32768

In [5]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 214 ms, sys: 4.39 ms, total: 218 ms
Wall time: 217 ms

In [6]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*4,1024*4), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
65536

In [7]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 472 ms, sys: 32.4 ms, total: 504 ms
Wall time: 503 ms

In [8]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*8,1024*8), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
262144

In [9]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 1.86 s, sys: 63.4 ms, total: 1.92 s
Wall time: 1.92 s

Summary

Time for SingleHdf5ToZarr.translate():

Number of Chunks Before this pull request, with get_chunk_info After this pull request, with chunk_iter Ratio
16,384 13 seconds 0.155 seconds 84x
32,768 74 seconds 0.218 seconds 339x
65,536 393 seconds 0.504 seconds 780x
262,144 6291 seconds 1.92 seconds 3277x

edit May 8: Changed After times to reflect "Total Time", added 262,144 chunks

@martindurant
Copy link
Member

Thanks for coming back to this - good to see it didn't take too many lines of code, but made a big impact!
Do you happen to know, are fewer bytes being transferred too? That would mean even bigger impact for remote scanning of HDFs.

@mkitti
Copy link
Contributor Author

mkitti commented May 3, 2023

Do you happen to know, are fewer bytes being transferred too?

My guess is yes, this does involve fewer bytes transferred. The current method is very inefficient. For each chunk, it starts a new search through the metadata.

The new method searches through the metadata exactly once.

What could mitigate the difference in transferred data is the metadata cache.

https://docs.hdfgroup.org/hdf5/develop/_t_n_m_d_c.html

Another factor could be how consolidated the metadata is. I usually like to have all the metadata in a single 4 MB block at the beginning of the file by setting the metadata block size.

https://docs.hdfgroup.org/hdf5/develop/group___f_a_p_l.html#ga8822e3dedc8e1414f20871a87d533cb1

@martindurant
Copy link
Member

Another factor could be how consolidated the metadata is. I usually like to have all the metadata in a single 4 MB block at the beginning of the file by setting the metadata block size.

We typically open HDF5 files with "first" caching strategy, meaning that the first block (usually 5MB) is never flushed, so it sounds like we were already doing the right thing here. That would not, of course, help with metadata spread throughout the file.

@martindurant martindurant merged commit a25c529 into fsspec:main May 3, 2023
@mkitti
Copy link
Contributor Author

mkitti commented May 3, 2023

If one just makes the meta data block larger them the meta data would be more consolidated.

@martindurant
Copy link
Member

Only if you control creation of the hdf5 file! I bet most never touch or even know about such options.

@mkitti
Copy link
Contributor Author

mkitti commented May 3, 2023

Only if you control creation of the hdf5 file! I bet most never touch or even know about such options.

This is a big issue since the defaults are VERY conservative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter
2 participants