-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Tgav Stub Component #61
base: master
Are you sure you want to change the base?
Conversation
CI will fail right now due to the previous version of the |
Looks good to me. Only one comment, is there a way to keep track of the fact that Tgav sometime is defined as land-only and sometime (most of the times) refers to land+ocean average? |
@claudiatebaldi Thanks! I am not sure and will have to think about this. I think it would be assumed from the parent model in use. Perhaps we just make that as a parameter for the config file where the user can specify. Although, aren't the land-only values taking the land+ocean interactions into account anyway even if they are not reporting ocean grid values? |
Not sure what you mean with the "although" question, Chris. My point being only that if one is using a Tgav series it would be good to know if it is a global average or just land-only.But then again I'm not sure how this would work operationally, so it may be a redundant piece of information, already obvious to the user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crvernon My biggest concern with this implementation is that we are doing a lot of work to recreate within the component a bunch of data that was already calculated within fldgen. Some of that information is already stored within the fldgen object, and any that isn't easily could be. I think that exposing that data via an exported function in the fldgen package would be simpler, easier to maintain, and more efficient.
Another thing to consider is that the current implementation seems to be relying on details of the fldgen emulator structure that are not part of fldgen's public interface. This creates another maintenance headache because changes to fldgen internals now have the potential to break this component, and there is no easy way for fldgen developers to catch that in testing. I really don't think that's a good idea.
On the subject of metadata, I think that's a really good idea, but I don't think the way it's been implemented here is a good way to go. Metadata needs to be supported in the framework, and ideally components should know as little about it as possible. I think that attaching it as an attribute to the data it describes is a good start, and a lot of useful metadata could be extracted from the configuration file automatically by the framework. There are some other notes along these lines in the line comments.
What I propose is that we split the metadata off as a separate feature from this new component. This component will work without metadata, and metadata needs to be supported in all of the other components too, so it doesn't make a whole lot of sense to bundle the new component and the metadata into a single feature update. Developing the metadata functionality in a separate feature would allow us to design it in a way that addresses metadata needs across the framework without unduly holding up this feature.
Finally, I've noted inline a variety of small issues in various places. There are several general utility functions that have been bundled with this component; those should probably be put into a utility module instead. There were also a couple of places where the documentation was a little confusing, or in some cases inconsistent with the way that Cassandra components are meant to be used.
Thanks for asking me to weigh in. Give me a call if you want to discuss any of these matters in more detail.
class TgavStubComponent(ComponentBase): | ||
"""Feed in an RDS file of from ESM runs used to train the fldgen emulator. | ||
|
||
This component is used instead of the HectorStubComponent to provide `tgav` to fldgen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental design principle of Cassandra is that components don't know too much about each other, so the docstring should be specifying what the component does, not how we expect it to be used. There could well be situations in which flegen uses the HectorStubComponent or some other component (such as a full Hector component) for its Tgav input. Conversely, there might be other components that use the data from this component from time to time.
Either way, it's for the user to decide what component they use for Tgav, or for any other capability, so that information doesn't belong in the component documentation.
This component is used instead of the HectorStubComponent to provide `tgav` to fldgen. | ||
|
||
The component provides two capabilities: | ||
:param Tgav: A data frame containing global mean temperature with fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that we finally have a concrete example of having alternate implementations of the same capability. With the addition of this component, any time global temperature is required, users have a choice of whether they use the Hector stub or this one. Swapping out the source of the Tgav data is as simple as changing the configuration file.
One thing that I did just notice here is that the HectorStub component provides two other capabilities, atm-co2 and Ftot. This component provides only Tgav, which creates a potential problem if you want to sub out the source of Tgav but keep the Hector atm-co2 and Ftot outputs (perhaps for use in some other component). I think what we need is a configuration option to direct a component not to export one or more of its capabilities. I'll create an issue for that.
|
||
This component is used instead of the HectorStubComponent to provide `tgav` to fldgen. | ||
|
||
The component provides two capabilities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this section of the documentation a little confusing. This sentence says there are two capabilities, but I had to read it a couple of times to figure out what they were. It looks like each component does this documentation a slightly different way, which is unfortunate, but I think the Xanthos component comes the closest to what we want. Here is the corresponding section from its documentation:
params:
config_file - Path to Xanthos config file
OutputNameStr - Name for the directory to create for Xanthos outputs
Capability dependencies (all optional):
gridded_pr - List of gridded monthly precipitation by grid cell
gridded_tas - List of gridded monthly temperature by grid cell
gridded_pr_coord - Matrix of lat/lon coordinates for the precip grid cells
gridded_tas_coord - Matrix of lat/lon coordinates for the tas grid cellss
results:
gridded_runoff - Capability 'gridded_runoff', a list of runoff matrices,
(gridcells x timestep) with the units and aggregation
level specified in the Xanthos config file
The only changes I would recommend to this are:
- Change "results" to "capabilities", since that's what they are.
- Reverse the order, so that capabilities come first, dependencies second, and parameters third.
That way, the first thing users see is a description of what the component does, which is the main thing they need to decide whether they want to use the component. The next is what other capabilities have to be in the system, which they need in order to plan their configuration, and the last thing is the parameters, which they need to actually write the configuration.
['year', 'scenario', 'variable', 'value', 'units']; where value is 'Tgav' | ||
:type Tgav: Pandas DataFrame | ||
|
||
:param tgav_metadata: A dictionary of metadata for the `Tgav` capability including: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it the tgav_metadata is the second capability offered by the component? If so, then I really don't like the idea of making it a separate capability from the capability that the metadata describes. Doing it that way puts too much onus on the user code to separately request the required data, and probably to produce the metadata as well.
Could we add the metadata as an attribute to the object that holds the data? That way, a component requesting data always gets the metadata along with it, even if the component is unaware that the metadata even exists. We could also have the ComponentBase.addResults
method compile the metadata from the component's configuration input, so that components that are unaware of metadata. We could even establish a convention that configuration parameters beginning with X-
are ignored by the rest of the argument parsing and simply copied into metadata. That way, a user writing a configuration file could arrange for any information they thought relevant to be recorded in the metadata without having to modify any of the component code.
As I think about it more, having the metadata handled by the framework instead of in the component code has another benefit, viz., that we can be certain of getting standardized metadata, not just whatever the component author thought to put in.
self.validate_params() | ||
|
||
# convert the ListVector object from reading a RDS file to a Python dictionary | ||
rds_dict = self.rds_to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a cleaner design to just write a function in the fldgen package to extract the information we want here (i.e., global temperature time series) and return it as a vector. Then this component would have only to call that function and store the return value (If I remember right, rpy2 automatically converts R vectors to numpy 1-D arrays).
I see two benefits to doing it that way. First, the added functionality in fldgen would probably be useful to fldgen users. Second, the code to handle the fldgen structures would be a lot simpler in R, where you can handle the relevant data structure natively, so the overall code complexity would be a lot lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crvernon, I was just discussing this with @FeralFlows, and it looks like we may be running into memory exhaustion. If so, then those problems are surely exacerbated by storing the entire rds file contents in a member variable, since that ensures that the data can't be gc'd until the component goes out of scope, which is basically the whole run. As far as I can tell, it doesn't look like we've fixed the memory bloat in the fldgen save files yet, so this adds up to a big chunk of memory.
I still think that having fldgen provide a function to extract the data we need is the right way to go, but at the very least, we need to manage the scope of these large objects very carefully.
Thinking on it a little more, I wonder if making this TgavStub
a separate component is the right move at all. The FldgenComponent
already reads in this data anyhow. What if we provided a configuration option to have FldgenComponent
provide a tgav
capability? Then, if that capability is turned on, fldgen would extract the needed information from the files it is already reading in, use those values of tgav
itself, and export them to the rest of the framework. If not, then it skips that step and requests tgav
from elsewhere in the framework.
A slight variant on that would be to implement the feature in issue #62. Then, configuring fldgen to compute its own tgav would always trigger the export, but a user could squelch that if they wanted to have the rest of the system get tgav from somewhere else. That would provide the most flexibility.
else: | ||
return valid_yr | ||
|
||
def validate_int(self, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above; this function belongs in a utility module.
msg = f"{self.__class__} Value '{value}' not able to be converted to an integer as expected." | ||
self.log_raise_exception(ValueError, msg) | ||
|
||
def validate_file_exist(self, file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another function that belongs in a utility module.
msg = f"Input file '{file_path}' does not exist." | ||
self.log_raise_exception(FileNotFoundError, msg) | ||
|
||
def rds_to_dict(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get away without this function, if we create an exported function in fldgen to extract the information we need and return it as a simple vector.
def build_file_dict(self, rds_dict): | ||
"""Get a list of target files that match the climate variable specified in the configuration. | ||
|
||
:param rds_dict: A Python dictionary of {variable name: value arrays} derived from the emulator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what is going on here. Are you extracting the names of netCDF files from the RDS file? I don't think it's necessary to go back to the original data to extract this information. I'm pretty sure we store the global mean temperature series in the fldgen save file. If not, then it would be easy to do so, since we have to calculate it as part of the training process. We definitely don't want to be going back to the original data. First, it's brittle (What if the original data files move? Or what if you want to take the calculation to a new computing system?), and second it's unnecessarily slow.
|
||
return df[self.TGAV_FIELD_ORDER] | ||
|
||
def tgav_metadata(self, df, target_file_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments above about a better way to handle metadata. I really think that insofar as possible, we need to generate this information down in the framework's classes. We could potentially create a hook for a component to add extra metadata beyond what the framework can glean from the input configuration, but it should be optional.
I agree with @claudiatebaldi this is useful information to have and while this might be obvious to the original user I think it would be extremely useful to specify this sort of information for other users in order to understand what is actually being read into the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @claudiatebaldi and @kdorheim re improved clarity of documentation in terms of what tgav means depending on data. I'm happy to brainstorm to figure out the best place for that, if I can be useful.
Also: Is it possible/not too much work to add the curve of global runoff driven by the actual ISIMIP data to the plot with the outputs from the 3 green fldgen-generated realizations? The entire reason for this change over from the Hector Tgavis because we originally saw systematic deviations in results between the fldgen-driven runs vs the actual ISIMIP data-driven runs.
FYI, note that SULI Skylar Gering's project this summer is to split Hector |
@crvernon note this PR in the fldgen repo (implementing updates discussed in group call about 3.5 or 4 weeks ago) that may be useful JGCRI/fldgen#49 |
Desire was to create a
Tgav
stub component to facilitate extracting data for a specific configuration (e.g., scenario, etc.) from ESM runs that were used to train thefldgen
emulator.Per @abigailsnyder:
The component named
TgavStubComponent
was added to Cassandra and provides the following capabilities:Tgav
: A data frame containing global mean temperature with fields ['year', 'scenario', 'variable', 'value', 'units']; where value isTgav
and type is a Pandas DataFrametgav_metadata
: A dictionary of metadata for theTgav
capability including: [rds_file, scenario, climate_var_name, source_climate_data, units, count, mean, median, min, max, std, na_count, null_count, all_finite] designations; typedict
The configuration block for the
TgavStubComponent
is the following:This component conducts validation at run time as well as being equipped with a robust test suite. All tests for this component are now passing locally. The metadata output is a summary of both the Tgav configuration as well as the data itself. The following is logged from the
tgav_metadata
capability:There are currently no constraints on these reported factors though they can be added easily if so desired. @kdorheim provided a summary of over 50 CMIP5 runs that we could use as
min
andmax
constraints from Taylor (2012) and stored here/pic/projects/GCAM/Dorheim/hectorcal/data-raw/CMIP5_annual_global_average.csv
:Please also consider constraints for finite, no data, and NaN if they were to be present in the data.
The following full run for this configuration was completed successfully:
With a logfile of:
I also checked the Xanthos outputs for global runoff per year resulting in the following:
Resolves #58