Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tgav Stub Component #61

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

Add Tgav Stub Component #61

wants to merge 18 commits into from

Conversation

crvernon
Copy link
Member

@crvernon crvernon commented Jun 12, 2020

Desire was to create a Tgav stub component to facilitate extracting data for a specific configuration (e.g., scenario, etc.) from ESM runs that were used to train the fldgen emulator.

Per @abigailsnyder:

"If ESM runs are all ISIMIP data, the quantity with the variable name tgav is actually land average temperature and not global average temperature as the name suggests."
"A compiled Rmarkdown notebook doing this extraction and labeling for a trained emulator from /pic/projects/GCAM/GE/drought-expt/fldgen-emulators
extracting-tgav-from-trained-emu.zip"

The component named TgavStubComponent was added to Cassandra and provides the following capabilities:

  • Tgav: A data frame containing global mean temperature with fields ['year', 'scenario', 'variable', 'value', 'units']; where value is Tgav and type is a Pandas DataFrame

  • tgav_metadata: A dictionary of metadata for the Tgav capability including: [rds_file, scenario, climate_var_name, source_climate_data, units, count, mean, median, min, max, std, na_count, null_count, all_finite] designations; type dict

The configuration block for the TgavStubComponent is the following:

[TgavStubComponent]
rds_file = <full path with filename and extension to the target RDS file>
climate_var_name = tasAdjust
scenario = rcp26
units = Kelvin
start_year = 1861
through_year = 2099

This component conducts validation at run time as well as being equipped with a robust test suite. All tests for this component are now passing locally. The metadata output is a summary of both the Tgav configuration as well as the data itself. The following is logged from the tgav_metadata capability:

INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: rds_file==/Users/d3y010/projects/cassandra/models/fldgen/fldgen-IPSL-CM5A-LR.rds
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: scenario==rcp26
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: climate_var_name==tasAdjust
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: source_climate_data==./training-data/tasAdjust_annual_IPSL-CM5A-LR_rcp26_18610101-20991231.nc
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: units==Kelvin
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: count==239
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: mean==286.8046116940164
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: median==286.30762280534697
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: min==284.78008340211915
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: max==288.6382439866686
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: std==1.182328423261446
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: na_count==0
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: null_count==0
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: all_finite==True

There are currently no constraints on these reported factors though they can be added easily if so desired. @kdorheim provided a summary of over 50 CMIP5 runs that we could use as min and max constraints from Taylor (2012) and stored here /pic/projects/GCAM/Dorheim/hectorcal/data-raw/CMIP5_annual_global_average.csv:

variable unit experiment min max mean
tas K historical 284.3061 289.1264 286.7314

Please also consider constraints for finite, no data, and NaN if they were to be present in the data.

The following full run for this configuration was completed successfully:

[Global]
#  Location of the jar file for the ModelInterface code, used to query GCAM outputs.
ModelInterface = /Users/d3y010/projects/cassandra/data/ModelInterface.jar

# Location of the DBXML libraries used by older versions of the ModelInterface code.
DBXMLlib = /Users/d3y010/projects/cassandra/data/lib

# Directory containing general input files.  (OPTIONAL - default is './input-data').
# Relative paths will be interpreted relative to the working directory
# (even if they don't begin with './')
inputdir = ./input-data
rgnconfig = rgn32

[TgavStubComponent]
rds_file = /Users/d3y010/projects/cassandra/models/fldgen/fldgen-IPSL-CM5A-LR.rds
climate_var_name = tasAdjust
scenario = rcp26
units = Kelvin
start_year = 1861
through_year = 2099

[XanthosComponent]
config_file = /Users/d3y010/projects/cassandra/models/xanthos/trn_abcd_IPSL-CM5A-LR.ini
OutputNameStr = trn_abcd_IPSL-CM5A-LR_rcp26_0_alt
ProjectName = trn_abcd_IPSL-CM5A-LR_rcp26_0_alt
mp.weight = 8.0

[FldgenComponent]
loadpkgs = False
pkgdir = .
emulator = /Users/d3y010/projects/cassandra/models/fldgen/fldgen-IPSL-CM5A-LR.rds
ngrids = 2
scenario = rcp26
a2mfrac = alpha_ipsl_cm5a_lr
startyr = 1861
nyear = 239
RNGseed = -197043577
mp.weight = 10.0

With a logfile of:

DEBUG:root:General parameters as input:
DEBUG:root:{}
INFO:root:running <class 'cassandra.components.GlobalParamsComponent'>
DEBUG:root:starting <class 'cassandra.components.GlobalParamsComponent'>
INFO:root:[None]: default path= None  filename= /Users/d3y010/projects/cassandra/data/ModelInterface.jar
INFO:root:[None]: default path= None  filename= /Users/d3y010/projects/cassandra/data/lib
INFO:root:running <class 'cassandra.components.TgavStubComponent'>
INFO:root:[None]: default path= /Users/d3y010/repos/github/cassandra/cassandra  filename= ./input-data
DEBUG:root:starting <class 'cassandra.components.TgavStubComponent'>
INFO:root:[None]: default path= /Users/d3y010/repos/github/cassandra/cassandra/input-data  filename= rgn32
INFO:root:running <class 'cassandra.components.XanthosComponent'>
DEBUG:root:<class 'cassandra.components.GlobalParamsComponent'>: finished successfully.

DEBUG:root:starting <class 'cassandra.components.XanthosComponent'>
INFO:root:running <class 'cassandra.components.FldgenComponent'>
DEBUG:root:completed <class 'cassandra.components.GlobalParamsComponent'>
DEBUG:root:starting <class 'cassandra.components.FldgenComponent'>
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: rds_file==/Users/d3y010/projects/cassandra/models/fldgen/fldgen-IPSL-CM5A-LR.rds
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: scenario==rcp26
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: climate_var_name==tasAdjust
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: source_climate_data==./training-data/tasAdjust_annual_IPSL-CM5A-LR_rcp26_18610101-20991231.nc
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: units==Kelvin
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: count==239
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: mean==286.8046116940164
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: median==286.30762280534697
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: min==284.78008340211915
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: max==288.6382439866686
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: std==1.182328423261446
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: na_count==0
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: null_count==0
INFO:root:<class 'cassandra.components.TgavStubComponent'> 'Tgav' data summary: all_finite==True
DEBUG:root:<class 'cassandra.components.TgavStubComponent'>: finished successfully.

DEBUG:root:completed <class 'cassandra.components.TgavStubComponent'>
DEBUG:root:Result for tas: len = 2. Shape = (67420, 2868)
DEBUG:root:Result for pr: len = 2. Shape = (67420, 2868)
DEBUG:root:<class 'cassandra.components.FldgenComponent'>: finished successfully.

DEBUG:root:completed <class 'cassandra.components.FldgenComponent'>
INFO:root:ProjectName : trn_abcd_IPSL-CM5A_gexp
INFO:root:InputFolder : /Users/d3y010/repos/github/xanthos/example/input
INFO:root:OutputFolder: /Users/d3y010/repos/github/xanthos/example/output/trn_abcd_IPSL-CM5A_gexp
INFO:root:StartYear - End Year: 1861-2099
INFO:root:Number of Months    : 2868
INFO:root:Running: Future Mode
INFO:root:TempMinFile variable not found for the ABCD runoff module; Snowmelt will not be accounted for.
INFO:root:---Simulation in progress...
INFO:root:	Processing PET...
INFO:root:	PET processed in 17.734493017196655 seconds---
INFO:root:	Processing Runoff...
INFO:root:		Processing spin-up and simulation for basins 1...235
INFO:root:	Runoff processed in 45.51595687866211 seconds---
INFO:root:---Simulation has finished successfully: 63.26426696777344 seconds ---
INFO:root:---Output simulation results:
DEBUG:root:Outputting data annually
DEBUG:root:Unit is km3peryear
DEBUG:root:q output dimension is (67420, 239)
INFO:root:Aggregating by Basin
INFO:root:Aggregated unit is km3peryear
INFO:root:---Output finished: 8.618645906448364 seconds ---
INFO:root:End of trn_abcd_IPSL-CM5A_gexp

I also checked the Xanthos outputs for global runoff per year resulting in the following:

test_runoff

Resolves #58

@crvernon
Copy link
Member Author

CI will fail right now due to the previous version of the .travis.yml not loading an R environment.

@claudiatebaldi
Copy link

Looks good to me. Only one comment, is there a way to keep track of the fact that Tgav sometime is defined as land-only and sometime (most of the times) refers to land+ocean average?

@crvernon
Copy link
Member Author

@claudiatebaldi Thanks! I am not sure and will have to think about this. I think it would be assumed from the parent model in use. Perhaps we just make that as a parameter for the config file where the user can specify. Although, aren't the land-only values taking the land+ocean interactions into account anyway even if they are not reporting ocean grid values?

@claudiatebaldi
Copy link

claudiatebaldi commented Jun 12, 2020

Not sure what you mean with the "although" question, Chris. My point being only that if one is using a Tgav series it would be good to know if it is a global average or just land-only.But then again I'm not sure how this would work operationally, so it may be a redundant piece of information, already obvious to the user

@crvernon crvernon requested a review from rplzzz June 12, 2020 21:16
Copy link
Contributor

@rplzzz rplzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crvernon My biggest concern with this implementation is that we are doing a lot of work to recreate within the component a bunch of data that was already calculated within fldgen. Some of that information is already stored within the fldgen object, and any that isn't easily could be. I think that exposing that data via an exported function in the fldgen package would be simpler, easier to maintain, and more efficient.

Another thing to consider is that the current implementation seems to be relying on details of the fldgen emulator structure that are not part of fldgen's public interface. This creates another maintenance headache because changes to fldgen internals now have the potential to break this component, and there is no easy way for fldgen developers to catch that in testing. I really don't think that's a good idea.

On the subject of metadata, I think that's a really good idea, but I don't think the way it's been implemented here is a good way to go. Metadata needs to be supported in the framework, and ideally components should know as little about it as possible. I think that attaching it as an attribute to the data it describes is a good start, and a lot of useful metadata could be extracted from the configuration file automatically by the framework. There are some other notes along these lines in the line comments.

What I propose is that we split the metadata off as a separate feature from this new component. This component will work without metadata, and metadata needs to be supported in all of the other components too, so it doesn't make a whole lot of sense to bundle the new component and the metadata into a single feature update. Developing the metadata functionality in a separate feature would allow us to design it in a way that addresses metadata needs across the framework without unduly holding up this feature.

Finally, I've noted inline a variety of small issues in various places. There are several general utility functions that have been bundled with this component; those should probably be put into a utility module instead. There were also a couple of places where the documentation was a little confusing, or in some cases inconsistent with the way that Cassandra components are meant to be used.

Thanks for asking me to weigh in. Give me a call if you want to discuss any of these matters in more detail.

class TgavStubComponent(ComponentBase):
"""Feed in an RDS file of from ESM runs used to train the fldgen emulator.

This component is used instead of the HectorStubComponent to provide `tgav` to fldgen.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental design principle of Cassandra is that components don't know too much about each other, so the docstring should be specifying what the component does, not how we expect it to be used. There could well be situations in which flegen uses the HectorStubComponent or some other component (such as a full Hector component) for its Tgav input. Conversely, there might be other components that use the data from this component from time to time.

Either way, it's for the user to decide what component they use for Tgav, or for any other capability, so that information doesn't belong in the component documentation.

This component is used instead of the HectorStubComponent to provide `tgav` to fldgen.

The component provides two capabilities:
:param Tgav: A data frame containing global mean temperature with fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we finally have a concrete example of having alternate implementations of the same capability. With the addition of this component, any time global temperature is required, users have a choice of whether they use the Hector stub or this one. Swapping out the source of the Tgav data is as simple as changing the configuration file.

One thing that I did just notice here is that the HectorStub component provides two other capabilities, atm-co2 and Ftot. This component provides only Tgav, which creates a potential problem if you want to sub out the source of Tgav but keep the Hector atm-co2 and Ftot outputs (perhaps for use in some other component). I think what we need is a configuration option to direct a component not to export one or more of its capabilities. I'll create an issue for that.


This component is used instead of the HectorStubComponent to provide `tgav` to fldgen.

The component provides two capabilities:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this section of the documentation a little confusing. This sentence says there are two capabilities, but I had to read it a couple of times to figure out what they were. It looks like each component does this documentation a slightly different way, which is unfortunate, but I think the Xanthos component comes the closest to what we want. Here is the corresponding section from its documentation:

    params:
          config_file    - Path to Xanthos config file
          OutputNameStr  - Name for the directory to create for Xanthos outputs

    Capability dependencies (all optional):
           gridded_pr  - List of gridded monthly precipitation by grid cell
          gridded_tas  - List of gridded monthly temperature by grid cell
     gridded_pr_coord  - Matrix of lat/lon coordinates for the precip grid cells
    gridded_tas_coord  - Matrix of lat/lon coordinates for the tas grid cellss

    results:
       gridded_runoff  - Capability 'gridded_runoff', a list of runoff matrices,
                         (gridcells x timestep) with the units and aggregation
                         level specified in the Xanthos config file

The only changes I would recommend to this are:

  1. Change "results" to "capabilities", since that's what they are.
  2. Reverse the order, so that capabilities come first, dependencies second, and parameters third.

That way, the first thing users see is a description of what the component does, which is the main thing they need to decide whether they want to use the component. The next is what other capabilities have to be in the system, which they need in order to plan their configuration, and the last thing is the parameters, which they need to actually write the configuration.

['year', 'scenario', 'variable', 'value', 'units']; where value is 'Tgav'
:type Tgav: Pandas DataFrame

:param tgav_metadata: A dictionary of metadata for the `Tgav` capability including:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it the tgav_metadata is the second capability offered by the component? If so, then I really don't like the idea of making it a separate capability from the capability that the metadata describes. Doing it that way puts too much onus on the user code to separately request the required data, and probably to produce the metadata as well.

Could we add the metadata as an attribute to the object that holds the data? That way, a component requesting data always gets the metadata along with it, even if the component is unaware that the metadata even exists. We could also have the ComponentBase.addResults method compile the metadata from the component's configuration input, so that components that are unaware of metadata. We could even establish a convention that configuration parameters beginning with X- are ignored by the rest of the argument parsing and simply copied into metadata. That way, a user writing a configuration file could arrange for any information they thought relevant to be recorded in the metadata without having to modify any of the component code.

As I think about it more, having the metadata handled by the framework instead of in the component code has another benefit, viz., that we can be certain of getting standardized metadata, not just whatever the component author thought to put in.

self.validate_params()

# convert the ListVector object from reading a RDS file to a Python dictionary
rds_dict = self.rds_to_dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a cleaner design to just write a function in the fldgen package to extract the information we want here (i.e., global temperature time series) and return it as a vector. Then this component would have only to call that function and store the return value (If I remember right, rpy2 automatically converts R vectors to numpy 1-D arrays).

I see two benefits to doing it that way. First, the added functionality in fldgen would probably be useful to fldgen users. Second, the code to handle the fldgen structures would be a lot simpler in R, where you can handle the relevant data structure natively, so the overall code complexity would be a lot lower.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crvernon, I was just discussing this with @FeralFlows, and it looks like we may be running into memory exhaustion. If so, then those problems are surely exacerbated by storing the entire rds file contents in a member variable, since that ensures that the data can't be gc'd until the component goes out of scope, which is basically the whole run. As far as I can tell, it doesn't look like we've fixed the memory bloat in the fldgen save files yet, so this adds up to a big chunk of memory.

I still think that having fldgen provide a function to extract the data we need is the right way to go, but at the very least, we need to manage the scope of these large objects very carefully.

Thinking on it a little more, I wonder if making this TgavStub a separate component is the right move at all. The FldgenComponent already reads in this data anyhow. What if we provided a configuration option to have FldgenComponent provide a tgav capability? Then, if that capability is turned on, fldgen would extract the needed information from the files it is already reading in, use those values of tgav itself, and export them to the rest of the framework. If not, then it skips that step and requests tgav from elsewhere in the framework.

A slight variant on that would be to implement the feature in issue #62. Then, configuring fldgen to compute its own tgav would always trigger the export, but a user could squelch that if they wanted to have the rest of the system get tgav from somewhere else. That would provide the most flexibility.

else:
return valid_yr

def validate_int(self, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above; this function belongs in a utility module.

msg = f"{self.__class__} Value '{value}' not able to be converted to an integer as expected."
self.log_raise_exception(ValueError, msg)

def validate_file_exist(self, file_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another function that belongs in a utility module.

msg = f"Input file '{file_path}' does not exist."
self.log_raise_exception(FileNotFoundError, msg)

def rds_to_dict(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get away without this function, if we create an exported function in fldgen to extract the information we need and return it as a simple vector.

def build_file_dict(self, rds_dict):
"""Get a list of target files that match the climate variable specified in the configuration.

:param rds_dict: A Python dictionary of {variable name: value arrays} derived from the emulator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what is going on here. Are you extracting the names of netCDF files from the RDS file? I don't think it's necessary to go back to the original data to extract this information. I'm pretty sure we store the global mean temperature series in the fldgen save file. If not, then it would be easy to do so, since we have to calculate it as part of the training process. We definitely don't want to be going back to the original data. First, it's brittle (What if the original data files move? Or what if you want to take the calculation to a new computing system?), and second it's unnecessarily slow.


return df[self.TGAV_FIELD_ORDER]

def tgav_metadata(self, df, target_file_dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments above about a better way to handle metadata. I really think that insofar as possible, we need to generate this information down in the framework's classes. We could potentially create a hook for a component to add extra metadata beyond what the framework can glean from the input configuration, but it should be optional.

@kdorheim
Copy link

Not sure what you mean with the "although" question, Chris. My point being only that if one is using a Tgav series it would be good to know if it is a global average or just land-only.But then again I'm not sure how this would work operationally, so it may be a redundant piece of information, already obvious to the user

I agree with @claudiatebaldi this is useful information to have and while this might be obvious to the original user I think it would be extremely useful to specify this sort of information for other users in order to understand what is actually being read into the pipeline.

Copy link

@abigailsnyder abigailsnyder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @claudiatebaldi and @kdorheim re improved clarity of documentation in terms of what tgav means depending on data. I'm happy to brainstorm to figure out the best place for that, if I can be useful.
Also: Is it possible/not too much work to add the curve of global runoff driven by the actual ISIMIP data to the plot with the outputs from the 3 green fldgen-generated realizations? The entire reason for this change over from the Hector Tgavis because we originally saw systematic deviations in results between the fldgen-driven runs vs the actual ISIMIP data-driven runs.

@bpbond
Copy link
Member

bpbond commented Jun 17, 2020

FYI, note that SULI Skylar Gering's project this summer is to split Hector Tgav into separate land and ocean variables, following e.g. MAGICC, with land Tgav warming faster in general.

@abigailsnyder
Copy link

@crvernon note this PR in the fldgen repo (implementing updates discussed in group call about 3.5 or 4 weeks ago) that may be useful JGCRI/fldgen#49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create data stub component for tgav
6 participants