-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
h5netcdf: a new interface for writing netCDF4 files via h5py #390
Comments
Does it read data from Opendap data sets also? |
@rsignell-usgs Nope, it can only do netCDF4/HDF5 files. But you could try pydap for that... |
Very cool - if you combine this with pupynere you could have a pure python implementation of the netcdf c library (except for the DAP part)! Just a couple of things I noticed in 15 minutes of testing.
This is not related to h5netcdf per se, but has something to do with the way the hdf5 lib is initialized in h5py I think.
|
Just for clarification, this isn't a pure python implementation exactly because it still depends on the HDF C library via h5py, right? |
Indeed, calling this pure Python is perhaps overly generous -- it does depend on the HDF5 C library. The implementation of netCDF4 on top of HDF5 is pure Python, though. h5py already supports both fill values and unlimited dimensions, so that should be pretty easy to hook up. |
@jswhit as for your initialization issues, that does seem strange/unfortunate. I haven't encountered that personally on OS X (I'm using h5py and netCDF4 via conda). |
does h5py support "orthogonal indexing" with booleans and integers? |
@jswhit Yes and no. It doesn't do numpy-style broadcasting indexing so it's not inconsistent with orthogonal indexing, but it also does not support indexing with one than one array, e.g., netCDF4-python can't really support this sort of indexing efficiently, either, so perhaps this is not such a terrible thing. Also, I'll be able to support orthogonal indexing with xray/h5netcdf using dask as an intermediate layer. |
Good work @shoyer , however I'm intrigued when you say its speed is similar to netCDF4-python, From my experience h5py is at least 3 times faster then netCDF4-python. Other concurrents (like Nio and scientific) are also generally faster, in some occasions for me Scientific.IO.NetCDF was up to 10 times faster. |
@gamaanderson Interesting. I'm sure it depends on lots of specifics about your configuration and workflow. For example, I found that If you'd like to give h5netcdf a try, I would be interested to see the performance numbers on your workflows. I recently added support for fill value, so at least for basic usage (I haven't quite figured out how unlimited dimensions are stored in netCDF4 yet) it should be directly interchangeable with netCDF4-python -- I actually have tests to verify that you could do something like |
Couldn’t get it to work. Any way, could you say what software version you are using. |
@gamaanderson I've been developing h5netcdf on h5py 2.4 and 2.5. You'll definitely need at least h5py 2.1, because that's the first release that included support for dimension scales (which are central to the netCDF4 data model). Looks like |
Well I did get it to run, it reduce time from 19s to 5s, however I had more problems than expected:
Unfortunately I will not be able to use in my main projects, its a cooperative work and there are some insisting in using the "official" lib. But it does corroborate with my opinion that netCDF does need to improve its performance. |
@gamaanderson This feedback is super helpful, thanks!
I'm pleased to hear that you found h5netcdf to be so much faster for your use case. If you can't share your benchmark script, could you at least roughly summarize what it's doing? I would like to be able to reproduce these benchmarks myself... |
The project I used to test is (https://github.com/ARM-DOE/pyart), in special I just change function About number 1, I personally prefer to use .attrs[] to separate what is from the file and what is python, but yes some people are using direct attributes. About number 3 I also don't think its important, "".join() work just as well. |
I should also say, it is reading a NetCDF with the CfRadial convention totally into the memory, to a more practical structure. CfRadial convention is for meteorological radar data in original, spherical, coordinates. A typical file has the following header |
This is not exactly a netCDF4-python issue (so feel free to close), but I thought users of this repo might be interested to test out my latest project, h5netcdf, an alternative interface for reading/writing netCDF4 as HDF5 files directly via h5py.
Feedback would be greatly appreciated!
My initial performance tests suggest that it generally has very similar performance to netCDF4-python, except for multi-threaded writes to a single file, for which it is about twice as fast (I tested against v1.1.6). I haven't tested compression yet.
The text was updated successfully, but these errors were encountered: