Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with ATLAS datasets #317

Closed
sol1105 opened this issue Jan 10, 2024 · 8 comments
Closed

Problems with ATLAS datasets #317

sol1105 opened this issue Jan 10, 2024 · 8 comments
Assignees

Comments

@sol1105
Copy link
Contributor

sol1105 commented Jan 10, 2024

  • clisops version: 0.12.2
  • Python version: -
  • Operating System: -

Description

The ATLAS datasets are aggregated CMIP5/6 or CORDEX datasets that have been remapped to a regular grid and contain data from multiple sources in a single data file (arranged along a new dimension member): Link1 Link2. It is planned that clisops supports the processing of these datasets in the future. First tests show the following problems:

  • Multiple fill values are defined for the data variable. This is already supported with Fix issue 308 fillvalue #309
  • Processing of the data seems to be working, however, writing the result to disk is not possible due to:
    • The DRS (folder structure and file name structure) deviates from CMIP and CORDEX specifications, therefore the clisops filenamer has to be updated.
    • However also the filenamer simple cannot write processed output to disk. This is due to a netCDF error, caused by the deflate settings of string/character variables in the ATLAS datasets.

What I Did

ds=xr.open_dataset("sst_CMIP6_ssp245_mon_201501-210012.nc")
ds.sst.encoding["_FillValue"]=None   # required, since multiple fill values are defined for the data variable
ds.to_netcdf("atlas.nc")

This fails with:

RuntimeError                              Traceback (most recent call last)
---->  ds.to_netcdf("atlas.nc")

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/core/dataset.py:1911, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1908     encoding = {}
   1909 from xarray.backends.api import to_netcdf
-> 1911 return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
   1912     self,
   1913     path,
   1914     mode=mode,
   1915     format=format,
   1916     group=group,
   1917     engine=engine,
   1918     encoding=encoding,
   1919     unlimited_dims=unlimited_dims,
   1920     compute=compute,
   1921     multifile=False,
   1922     invalid_netcdf=invalid_netcdf,
   1923 )

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/api.py:1217, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1212 # TODO: figure out how to refactor this logic (here and in save_mfdataset)
   1213 # to avoid this mess of conditionals
   1214 try:
   1215     # TODO: allow this work (setting up the file for writing array data)
   1216     # to be parallelized with dask
-> 1217     dump_to_store(
   1218         dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   1219     )
   1220     if autoclose:
   1221         store.close()

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/api.py:1264, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1261 if encoder:
   1262     variables, attrs = encoder(variables, attrs)
-> 1264 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/common.py:271, in AbstractWritableDataStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    269 self.set_attributes(attributes)
    270 self.set_dimensions(variables, unlimited_dims=unlimited_dims)
--> 271 self.set_variables(
    272     variables, check_encoding_set, writer, unlimited_dims=unlimited_dims
    273 )

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/common.py:309, in AbstractWritableDataStore.set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
    307 name = _encode_variable_name(vn)
    308 check = vn in check_encoding_set
--> 309 target, source = self.prepare_variable(
    310     name, v, check, unlimited_dims=unlimited_dims
    311 )
    313 writer.add(source, target)

File ~/anaconda3/envs/clisopsnew/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:488, in NetCDF4DataStore.prepare_variable(self, name, variable, check_encoding, unlimited_dims)
    486     nc4_var = self.ds.variables[name]
    487 else:
--> 488     nc4_var = self.ds.createVariable(
    489         varname=name,
    490         datatype=datatype,
    491         dimensions=variable.dims,
    492         zlib=encoding.get("zlib", False),
    493         complevel=encoding.get("complevel", 4),
    494         shuffle=encoding.get("shuffle", True),
    495         fletcher32=encoding.get("fletcher32", False),
    496         contiguous=encoding.get("contiguous", False),
    497         chunksizes=encoding.get("chunksizes"),
    498         endian="native",
    499         least_significant_digit=encoding.get("least_significant_digit"),
    500         fill_value=fill_value,
    501     )
    503 nc4_var.setncatts(attrs)
    505 target = NetCDF4ArrayWrapper(name, self)

File src/netCDF4/_netCDF4.pyx:2962, in netCDF4._netCDF4.Dataset.createVariable()

File src/netCDF4/_netCDF4.pyx:4202, in netCDF4._netCDF4.Variable.__init__()

File src/netCDF4/_netCDF4.pyx:2029, in netCDF4._netCDF4._ensure_nc_success()

RuntimeError: NetCDF: Filter error: bad id or parameters or duplicate filter

It works when overwriting the encoding settings of the character/string variables introduced in the ATLAS datasets:

ds=xr.open_dataset("/sst_CMIP6_ssp245_mon_201501-210012.nc")
ds.sst.encoding["_FillValue"]=None
for cvar in ["member_id", "gcm_variant", "gcm_model", "gcm_institution"]:
    for en in ["zlib", "shuffle", "complevel"]:
        del ds[cvar].encoding[en]
ds.to_netcdf("atlas.nc")
@cehbrecht
Copy link
Collaborator

@sol1105 thanks for looking at the atlas issue :) That means we can support subsetting atlas when we add an "atlas-fix"?

@sol1105
Copy link
Contributor Author

sol1105 commented Jan 11, 2024

@cehbrecht Yes, I think so. To be sure I would add further ATLAS test datasets (ATLAS CORDEX, CMIP5) and tests for subset, regrid (and if applicable also average) operators when we implement this fix.

I suggest a general fix in clisops:
When reading in datasets, check for string/character variables, and, if present, remove any deflation options as in my above post (unless we gain further insight of what - netcdf application in xarray or netcdf or generally trying to compress character variables - causes this issue).

Should we raise this as an issue for netcdf (or possibly xarray) as well? In a v2 of ATLAS maybe the fillvalue and deflate problems should be addressed. Can you inform them about these issues?

Edit: Also cdo cannot open these files without problems, since the member dimension is the first dimension, which is not supported by cdo (it expects time as the first dimension)

@sol1105
Copy link
Contributor Author

sol1105 commented Jan 12, 2024

I found some more information on that problem in the netcdf-python repo:
Unidata/netcdf4-python#1205
The problem has apparently been fixed in netcdf-c (main branch) and will be part of an upcoming 4.9.3 release:
Unidata/netcdf-c#2716

@cofinoa
Copy link

cofinoa commented Jan 16, 2024

Hi, you are faced with two viable alternatives:

  1. Utilize netcdf-c version <4.9, for writing the new files containing subsetted data. Despite no error being raised, it's important to note that filters are not being applied as expected and the process occurs silently.
  2. Opt for netcdf-c version >=4.9, but with the caveat of removing all filters from String datatype variables to circumvent potential errors.

The ATLAS v1 dataset was crafted using netcdflib version 4.4.1.1 and hdf5lib version 1.10.1, a deliberate decision aimed at optimizing format readability with other library versions to the fullest extent possible.

@sol1105
Copy link
Contributor Author

sol1105 commented Jan 17, 2024

@cofinoa Thanks for your reply. The netcdf_c PR I referenced above, suggests that string variables in the ATLAS dataset will not be readable in future netcdf-c releases:

The problem has apparently been fixed in netcdf-c (main branch) and will be part of an upcoming 4.9.3 release:
Unidata/netcdf-c#2716

So our planned xarray-fix becomes useless with future netcdf-c releases. The file metadata themselves have to be altered so they will remain fully readable, independent of the netcdf-c version.

@cofinoa
Copy link

cofinoa commented Jan 17, 2024

@sol1105 the PR Unidata/netcdf-c#2716 just make "unreadable" VL datatype datasets/variables which filters are NO-OPTIONAL.

Therefore, the ATLAS v1 dataset will be readable for the next release, the filters applied to String variables are optional, and still will be readable for the next netcdf-c.

The netCDF library has strong principle to make readable any data been generated by previous library versions, for curation purposes.

What was problematic was Unidata/netcdf-c#2231 in netcdf-c version >=4.9. This PR "broke" code that write VL datatypes with filters and they were silently ignored and not applied, but the PR raises an error is being raised, make this code buggy.

The PR Unidata/netcdf-c#2716 in next release, will raise an error only if filter is NO OPTIONAL when VL dataype data will be written, and ignored and just warning user that filter is not applied when filter is OPTIONAL.

Said that, I will test ATLAS v1 with next netcdf-c release.

For the xarray-fix, the 3rd option would be use next netcdf-c release to write the subsetted data.

@cofinoa
Copy link

cofinoa commented Jan 31, 2024

Update:
just to confirm that the there is no issue with the last development version of netcdf-c 4.9.3:

netcdf library version 4.9.3-development of Jan 30 2024 17:53:03 $

We need to wait for the 4.9.3 release, but my conclusion it's to avoid netcdf-c version >=4.9 AND <4.9.3, because those versions break existing code that worked with previous versions (<4.9).

@sol1105
Copy link
Contributor Author

sol1105 commented Feb 14, 2024

I think this can be closed with #319 and roocs/roocs-utils#111 / roocs/roocs-utils#113 .
More info on the introduced changes/fixes can also be found there.

In general however the following issues should be addressed for future versions of the ATLAS datasets, since they may also affect the compatibility with other tools

@sol1105 sol1105 closed this as completed Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants