Use H5Dchunk_iter to get chunk information from HDF5 files #331

mkitti · 2023-05-03T03:27:13Z

Use H5Dchunk_iter to get chunk information from HDF5 files

This pull request makes chunk information retrieval from HDF5 files more efficient, especially when more than 16K chunks are involved. Translation times drop from tens of seconds to fractions of a second.

Fix #286

Here we replace dsid.get_chunk_info with dsid.chunk_iter which is available with HDF5 1.14.0 and will soon be available with HDF5 1.12.3. The HDF5 C call is H5Dchunk_iter.

The is particularly more efficient with a large number of chunks.

Before this pull request: N^2 scaling with the number of chunks

With 16,384 chunks translating takes 13 seconds. With 32,768 chunks, twice as many chunks, translating takes 74 seconds.

In [1]: import kerchunk.hdf, fsspec, h5py

In [2]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*2,1024*2), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...:
16384

In [3]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...:
CPU times: user 13.3 s, sys: 6.91 ms, total: 13.3 s
Wall time: 13.3 s

In [4]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*4,1024*2), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...:
32768

In [5]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...:
CPU times: user 1min 14s, sys: 34.1 ms, total: 1min 14s
Wall time: 1min 14s

In [6]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*4,1024*4), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
65536

In [7]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 

CPU times: user 6min 33s, sys: 275 ms, total: 6min 33s
Wall time: 6min 33s

In [8]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*8,1024*8), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
262144

In [9]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 1h 44min 51s, sys: 1.92 s, total: 1h 44min 53s
Wall time: 1d 8h 43min 40s

After this pull request: Linear scaling with the number of chunks

With 16,384 chunks, translating takes 0.131 seconds. With 32,768 chunks, twice as many chunks, translating takes 0.214 seconds.

In [1]: import kerchunk.hdf, fsspec, h5py

In [2]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*2,1024*2), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
16384

In [3]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 131 ms, sys: 23.8 ms, total: 155 ms
Wall time: 154 ms

In [4]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*4,1024*2), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
32768

In [5]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 214 ms, sys: 4.39 ms, total: 218 ms
Wall time: 217 ms

In [6]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*4,1024*4), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
65536

In [7]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 472 ms, sys: 32.4 ms, total: 504 ms
Wall time: 503 ms

In [8]: with h5py.File("pytest.h5", "w") as f:
   ...:     dset = f.create_dataset("test", (1024*8,1024*8), chunks=(16,16))
   ...:     dset[:] = 1
   ...:     print(dset.id.get_num_chunks())
   ...: 
262144

In [9]: %%time
   ...: with fsspec.open("pytest.h5") as inf:
   ...:     h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, "pytest.h5", inline_threshold=100)
   ...:     h5chunks.translate()
   ...: 
CPU times: user 1.86 s, sys: 63.4 ms, total: 1.92 s
Wall time: 1.92 s

Summary

Time for SingleHdf5ToZarr.translate():

Number of Chunks	Before this pull request, with `get_chunk_info`	After this pull request, with `chunk_iter`	Ratio
16,384	13 seconds	0.155 seconds	84x
32,768	74 seconds	0.218 seconds	339x
65,536	393 seconds	0.504 seconds	780x
262,144	6291 seconds	1.92 seconds	3277x

edit May 8: Changed After times to reflect "Total Time", added 262,144 chunks

martindurant · 2023-05-03T13:16:25Z

Thanks for coming back to this - good to see it didn't take too many lines of code, but made a big impact!
Do you happen to know, are fewer bytes being transferred too? That would mean even bigger impact for remote scanning of HDFs.

mkitti · 2023-05-03T13:38:25Z

Do you happen to know, are fewer bytes being transferred too?

My guess is yes, this does involve fewer bytes transferred. The current method is very inefficient. For each chunk, it starts a new search through the metadata.

The new method searches through the metadata exactly once.

What could mitigate the difference in transferred data is the metadata cache.

https://docs.hdfgroup.org/hdf5/develop/_t_n_m_d_c.html

Another factor could be how consolidated the metadata is. I usually like to have all the metadata in a single 4 MB block at the beginning of the file by setting the metadata block size.

https://docs.hdfgroup.org/hdf5/develop/group___f_a_p_l.html#ga8822e3dedc8e1414f20871a87d533cb1

martindurant · 2023-05-03T13:43:41Z

Another factor could be how consolidated the metadata is. I usually like to have all the metadata in a single 4 MB block at the beginning of the file by setting the metadata block size.

We typically open HDF5 files with "first" caching strategy, meaning that the first block (usually 5MB) is never flushed, so it sounds like we were already doing the right thing here. That would not, of course, help with metadata spread throughout the file.

mkitti · 2023-05-03T15:09:21Z

If one just makes the meta data block larger them the meta data would be more consolidated.

martindurant · 2023-05-03T15:15:34Z

Only if you control creation of the hdf5 file! I bet most never touch or even know about such options.

mkitti · 2023-05-03T18:13:53Z

Only if you control creation of the hdf5 file! I bet most never touch or even know about such options.

This is a big issue since the defaults are VERY conservative.

Use H5Dchunk_iter to get chunk information from HDF5 files

ee14382

mkitti mentioned this pull request May 3, 2023

Efficient HDF5 chunk iteration via HDF5 1.14, h5py 3.8, and H5Dchunk_iter #286

Closed

lint

b43d487

martindurant merged commit a25c529 into fsspec:main May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use H5Dchunk_iter to get chunk information from HDF5 files #331

Use H5Dchunk_iter to get chunk information from HDF5 files #331

mkitti commented May 3, 2023 •

edited

Loading

martindurant commented May 3, 2023

mkitti commented May 3, 2023

martindurant commented May 3, 2023

mkitti commented May 3, 2023

martindurant commented May 3, 2023

mkitti commented May 3, 2023

Use H5Dchunk_iter to get chunk information from HDF5 files #331

Use H5Dchunk_iter to get chunk information from HDF5 files #331

Conversation

mkitti commented May 3, 2023 • edited Loading

Use H5Dchunk_iter to get chunk information from HDF5 files

Before this pull request: N^2 scaling with the number of chunks

After this pull request: Linear scaling with the number of chunks

Summary

martindurant commented May 3, 2023

mkitti commented May 3, 2023

martindurant commented May 3, 2023

mkitti commented May 3, 2023

martindurant commented May 3, 2023

mkitti commented May 3, 2023

mkitti commented May 3, 2023 •

edited

Loading