Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop prototype code for applying "fixes" to decadal data #98

Closed
agstephens opened this issue Apr 28, 2021 · 11 comments
Closed

Develop prototype code for applying "fixes" to decadal data #98

agstephens opened this issue Apr 28, 2021 · 11 comments
Assignees

Comments

@agstephens
Copy link
Contributor

agstephens commented Apr 28, 2021

We have an example file to start developing the required "fixes" for the decadal data to be added to the CMIP6 holdings for C3S.

The original NetCDF (pre-fixes) file is here:

https://github.com/cp4cds/c3s34g_master/blob/master/Decadal/ESGF-original/tas_Amon_EC-Earth3_dcppA-hindcast_s1986-r1i1p1f1_gr_198611-198710.nc?raw=true

The modified NetCDF (post-fixes) file is here:

https://github.com/cp4cds/c3s34g_master/blob/master/Decadal/tas_Amon_EC-Earth3_dcppA-hindcast_s198611-r1i1p1f1_gr_198611-198710.nc?raw=true

The initial task is to write some xarray code that will apply the changes (using the first file), and then write an equivalent to the second file. Once we are happy with the approach, we can embed it into dachar and daops.

Rules to create compliant format for files

Rules:

  • change time:long_name value to "valid_time"
  • add reftime variable:
    • from global attribute startdate (e.g.: :startdate = "s198611" ;)
    • assume ref day and hour: (day=1, hour=0) - but later we will be given defaults to use per model
  • add leadtime variable
  • add global attributes: - we will be given default values for these.
    • 	:forcing_description = "Free text describing the forcings" ; <-- needs value from PAB
      	:physics_description = "Free text describing the physics method" ; <-- needs value from PAB
      	:initialization_description = "Free text describing the initialization method" ; <-- needs value from PAB
      
  • check these global attrs already exist in the input data:
    • startdate,
    • sub_experiment_id
@agstephens agstephens changed the title Begin discussions about decadal data and roocs reqs Develop prototype code for applying "fixes" to decadal data May 7, 2021
@ellesmith88
Copy link
Contributor

ellesmith88 commented May 12, 2021

Here's what I've written for this so far: https://gist.github.com/ellesmith88/532e6395afe1b53567dd564b15696d0e

It's written as it would be in daops and dachar - with the generic functions and then dictionaries containing the info for each fix, so its quite long at the moment.

A few notes:

  • I've also added a fix to add the realization variable - this was in the modified file, but not the original. I'm not sure if this is needed as a fix.
  • The sub_experiment_id already exists in the original dataset but is different to the modified one. Currently my code doesn't change it because it already exists.
  • The lead time values are calculated by adding 30 days to the first calculated lead time. This is slightly different to what it would be if each lead time was calculated from the reference time, which I think would make more sense. At the moment, I've left it as being calculated by adding 30 days.
  • startdate is different to reftime - startdate is s198611 in modified file and reftime is 1986-10-31.
  • getting startdate to use as reftime (function get_start_date) expects startdate to be in the format sYYYYMM if it exists and relies on a default, from which it parses day, hour etc. Is this ok?
  • RESOLVED: The types for reftime and leadtime seem to be the same when I look at both files as xarray datasets, but when I convert my fixed dataset to netcdf, the types aren't the same as in the modified netcdf file:
    • reftime should be int but in my file it is int 64
    • leadtime should be double but in my file it is int 64
  • In these 2 cases: converting the modified netcdf file from above to xarray dataset and then back to netcdf and converting my fixed xarray dataset to a netcdf file, the time_bnds units disappear.

I'll have another look at the last 2 points.

additional point:

  • Outputting to xarray adds coordinates to time_bnds, lon_bnds and lat_bnds and realization, this dosen't exist in the modified file provded. e.g.
    Before:
double lat_bnds(lat, bnds) ;

After:

double lat_bnds(lat, bnds) ;
                lat_bnds:_FillValue = NaN ;
                lat_bnds:coordinates = "height reftime" ;

I've tried overwriting this in ds.lat_bnds.attrs["coordinates"] and ds.lat_bnds.encoding["coordinates"] as we have done with fill values, but I can't seem to get rid of it.

@ellesmith88
Copy link
Contributor

Updated here: https://gist.github.com/ellesmith88/532e6395afe1b53567dd564b15696d0e
encoding when outputting to netCDF can change the data types, to match the modified file provided, so would have to be passed through to clisops. I can't see how else to change this yet.

Looks like the dropping of the time_bnds unit is something that always happens when using xarrau, but it still exists in the xarray dataset under ds.time_bnds.encoding["units"], so not sure why it isn't passed through to netcdf or if its important.

@agstephens
Copy link
Contributor Author

agstephens commented May 28, 2021

Questions to BSC:

  • we added a fix to add the realization variable - this was in the modified file, but not the original. Is this required as a fix? YES
  • The sub_experiment_id already exists in the original dataset but is different to the modified one. What is the required behaviour? Yes. the new sub_experiment_id has the month in addition to the year.
  • We propose that the leadtime values should be calculated from the reftime + the valid time values. But in the example file, they are calculated by adding 30 day intervals to the previous leadtime. What is the correct way for us to calculate the leadtime? Yes. You can compute leadtime as reftime + time.
  • In the example file, the startdate global attribute is different to the reftime value. (startdate="s198611" and reftime = "1986-10-31"). How should we determine the correct value for the startdate? The reftime is start year (taken from the file name) - 11-01:00

@agstephens
Copy link
Contributor Author

Note to @agstephens and @ellesmith88: liaise with Piotr from the Met Office when we have implemented the fixes.

@pabretonniere
Copy link

Hi @agstephens

we added a fix to add the realization variable - this was in the modified file, but not the original. Is this required as a fix?

Yes. Exactly.

The sub_experiment_id already exists in the original dataset but is different to the modified one. What is the required behaviour?

Yes. the new sub_experiment_id has the month in addiation to the year.

We propose that the leadtime values should be calculated from the reftime + the valid time values. But in the example file, they are calculated by adding 30 day intervals to the previous leadtime. What is the correct way for us to calculate the leadtime?

Yes. You can compute leadtime as reftime + time.

In the example file, the startdate global attribute is different to the reftime value. (startdate="s198611" and reftime = "1986-10-31"). How should we determine the correct value for the startdate?

The reftime is start year (taken from the file name) - 11-01:00

@ellesmith88
Copy link
Contributor

ellesmith88 commented Jun 24, 2021

@agstephens
Copy link
Contributor Author

agstephens commented Aug 9, 2021

Release plan:

  • Check that xarray fix for coordinates has been merged into a release (which we are using). - looks good, test: v0.19.0 (which is the latest).
  • Check that we have a production elasticsearch index describing the required fixes (on a single dataset)
    • Make sure the index is backed up properly
    • DKRZ can have a local elasticsearch index - copied from public CEDA elasticsearch service. DKRZ service does not need to be public (could even just use a JSON store of any type).
    • Create elasticsearch indexes for "c3s", as opposed to a general "roocs" set? Update production systems to use them (ansible playbooks). See: https://github.com/roocs/roocs-utils/blob/master/roocs_utils/etc/roocs.ini
    • AS: test if I can create indexes without any assistance from RS.
  • Test the fix being applied properly in test and production environments.
  • Gather a full list of all datasets that are affected.
  • On production service:
    • Set apply_fixes=True as default
    • Ensure that the only fixes in the FixStore are the decadal fixes
  • Work out how to publish fixes for all datasets - I think Elle has created a script for this.
  • IMPORTANT: we need to tell ECMWF about all the possible requests and responses. We need a clear decision/rule on whether the user can get original files if a fix is listed. ANSWER IS: "NO!" - you must apply the fix. See director code: https://github.com/roocs/rook/blob/master/rook/director/director.py#L57

@agstephens
Copy link
Contributor Author

agstephens commented Sep 3, 2021

Working notes...

Update all versions of packages

cd roocs
. ./setup-env.sh
# Update all these on master
for repo in roocs-utils dachar clisops daops rook rooki; do
     git checkout master
     git pull
done

Writing to elasticsearch

Internal link to Kibana: https://kibana-ror-master.130.246.131.9.nip.io/app/dev_tools#/console

Get the API key and write to: tests/_common.py as:

[dachar:settings]
elastic_api_token = .....

Then run these tests...

$ python -m pytest tests/test_utils/test_json_store_elasticsearch.py

It is possible to see the changes happening in kibana by polling the following...(which shows the index being created, populated and then deleted...

GET roocs-char-test/_search 
{
  "query": {
    "match_all": {}
  }
}

There is no need to create an index, it happens automatically.

Read access from outside the firewall

I have updated the settings in Kibana so that indexes will be available outside the firewall if their are named "c3s*".

Test the fix

Here is how to test the new decadal fixes...

cd ../dachar/
git checkout decadal_fixes
cd ../daops/
git checkout decadal_fixes

Temporary patch (to my version of Intake) - for decoding issues using Dask.DataFrame - note that JV had this issue on the CDS and version fixes resolved it. My patch is just:

93-        if self.pattern is None:
94:            if urlpath.endswith(".gz"): self._csv_kwargs["compression"] = "gzip"
95-            self._dataframe = dask.dataframe.read_csv(

We can resolve it later.

Then the tests run:

cd ../daops/
git checkout decadal_fixes
python -m pytest tests
============================================== 103 passed, 3 skipped, 1 warning in 59.20s ==============================================

and

cd ../dachar/
git checkout decadal_fixes
python -m pytest tests
============================================== 103 passed, 3 skipped, 1 warning in 59.20s ==

We'll need to run the dachar script:

$ dachar propose-fixes --help
usage: dachar propose-fixes [-h] [-f FILES] [-d DATASET_LIST] [-t TEMPLATE]

optional arguments:
  -h, --help            show this help message and exit
  -f FILES, --files FILES
                        List of comma-separated json files containing
                        information to generate fix proposals. This option
                        must be used on its own
  -d DATASET_LIST, --dataset-list DATASET_LIST
                        Text file containing dataset ids for which to propose
                        the fix provided in the template. If using this option
                        you must provide a template using --template (-t)
                        option.
  -t TEMPLATE, --template TEMPLATE
                        Template for fix proposal. If using this option you
                        must provide a list of dataset ids using the
                        --dataset-list (-d) option.

@agstephens
Copy link
Contributor Author

agstephens commented Oct 13, 2021

After testing the fixes in daops, here are some issues to address:

  • leadtime should be float64 and not timedelta64 (and keep units=days)
  • global attr forcing_description - needs valid content
  • global attr physics_description - needs valid content
  • global attr initialization_description - needs valid content

@agstephens
Copy link
Contributor Author

model_specific_global_attrs = {
    "CMCC-CM2-SR5": {
        "forcing_description": "f1, CMIP6 historical forcings",
        "physics_description": "physics from the standard model configuration, with no additional tuning or different parametrization",
        "initialization_description": "hindcast initialized based on observations and using historical forcing"
    },
    "EC-Earth3": {
        "forcing_description": "f1, CMIP6 historical forcings",
        "physics_description": "physics from the standard model configuration, with no additional tuning or different parametrization",
        "initialization_description": "Atmosphere initialization based on full-fields from ERA-Interim (s1979-s2018) or ERA-40 (s1960-s1978); ocean/sea-ice initialization based on full-fields from NEMO/LIM assimilation run nudged towards ORA-S4 (s1960-s2018)"
    },
    "HadGEM3-GC31-MM": {
        "forcing_description": "f2, CMIP6 v6.2.0 forcings; no ozone remapping",
        "physics_description": "physics from the standard model configuration, with no additional tuning or different parametrization",
        "initialization_description": "hindcast initialized based on observations and using historical forcing"
    },
    "MPI-ESM1-2-HR": {
        "forcing_description": "f1, CMIP6 historical forcings",
        "physics_description": "physics from the standard model configuration, with no additional tuning or different parametrization",
        "initialization_description": "hindcast initialized based on observations and using historical forcing"
    }
}

@ellesmith88
Copy link
Contributor

closed via roocs/daops#87 and roocs/dachar#93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants