Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss best practices for xr.Dataset.to_netcdf() #28

Open
aaronspring opened this issue Mar 7, 2022 · 3 comments
Open

Discuss best practices for xr.Dataset.to_netcdf() #28

aaronspring opened this issue Mar 7, 2022 · 3 comments

Comments

@aaronspring
Copy link

aaronspring commented Mar 7, 2022

I want to use BitInformation and BitRound on NetCDF output. Anyone with experience with xarray?

What I do: see also #25 (comment)

My routine:

  1. run plot_bitinformation.jl and save keepbits between 0 and 23 to a dict with json
  2. use code from https://github.com/zarr-developers/numcodecs/pull/299/files to apply BitRound for each variable
  3. ds_rounded.to_netcdf(path, encoding, engine='h5netcdf')

I use engine="h5netcdf" because I can use {"compression": True} in encoding. I do this because some compression is needed for BitRound to be effective on disk storage. (I have only very little knowledge of compression and netcdf4 basics.) But probably there is a better way. What I do not want is that the compression impacts the read performance too much, i.e. reading the compressed data should not take much more (subjective I know) time that compared to before (or is this the tradeoff anyways).

https://xarray.pydata.org/en/latest/generated/xarray.Dataset.to_netcdf.html

# https://github.com/zarr-developers/numcodecs/pull/299/files
import numpy as np
from numcodecs.abc import Codec
from numcodecs.compat import ensure_ndarray, ndarray_copy

class BitRound(Codec):
    codec_id = 'bitround'

    def __init__(self, keepbits: int):
        if (keepbits < 0) or (keepbits > 23):
            raise ValueError("keepbits must be between 0 and 23")
        self.keepbits = keepbits

    def encode(self, buf):
        if self.keepbits == 23:
            return buf
        # TODO: figure out if we need to make a copy
        # Currently this appears to be overwriting the input buffer
        # Is that the right behavior?
        a = ensure_ndarray(buf).view()
        assert a.dtype == np.float32
        b = a.view(dtype=np.int32)
        maskbits = 23 - self.keepbits
        mask = (0xFFFFFFFF >> maskbits) << maskbits
        half_quantum1 = (1 << (maskbits - 1)) - 1
        b += ((b >> maskbits) & 1) + half_quantum1
        b &= mask
        return b

    def decode(self, buf, out=None):
        data = ensure_ndarray(buf).view(np.float32)
        out = ndarray_copy(data, out)
        return out

def bitround(data, keepbits):
    codec = BitRound(keepbits=keepbits)
    data = data.copy()  # otherwise overwrites the input
    encoded = codec.encode(data)
    return codec.decode(encoded)


# my code below
import xarray as xr
import sys
print('Require three args: python BitRound_file.py ifile.nc keepbits.json ofile')
assert len(sys.argv) == 3+1
_, ifile, keepbits_json, ofile = sys.argv

import json
print("keepbits_json",keepbits_json)
keepbits_loopup = json.load(open(keepbits_json))

ds = xr.open_dataset(ifile)
ds_round = ds.copy()
for v in ds.data_vars:
    if v in keepbits_loopup.keys():
        keepbits = keepbits_loopup[v]
        # here corrections could be applied
        # some hard fixes maybe
        # if keepbits < 2: 
        #     keepbits = 5
        ds_round[v].values = bitround(ds[v].values,keepbits)

ds_round=ds_round.assign_coords(ds.coords)

encoding={v:{'compression':True} for v in ds.data_vars if v in keepbits_loopup.keys()}
print('BitRound_file.py writes ', ofile)
ds_round.to_netcdf(ofile, mode='w', engine='h5netcdf', encoding=encoding)

EDIT:

Result:

>>> ds_rounded[v].encoding
{'zlib': True, 'shuffle': False, 'complevel': 4, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 5, 28, 64), 'source': '/work/bm1124/m300524/experiments/asp_esmControl_PMassim_1850_ATMsfn_over_2006/outdata/hamocc/asp_esmControl_PMassim_1850_ATMsfn_over_2006_hamocc_data_3d_ym_18590101_18591231_comp99.nc', 'original_shape': (1, 40, 220, 256), 'dtype': dtype('float32'), '_FillValue': nan, 'coordinates': 'lon lat'}
@milankl
Copy link
Owner

milankl commented Mar 8, 2022

What I do not want is that the compression impacts the read performance too much, i.e. reading the compressed data should not take much more (subjective I know) time that compared to before (or is this the tradeoff anyways).

Many lossless compressors have a decompression speed that's approximately independent of the compression level. So read performance is rarely something to worry about with this method. E.g.
image
(from https://www.nature.com/articles/s43588-021-00156-2)

@aaronspring
Copy link
Author

compress and save with xarray in xbitinfo: https://xbitinfo.readthedocs.io/en/latest/api.html#save-compressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants