ADD: variable_info() to help users discover variable names and mean…

…ings (#36)
esgf2-us · Mar 22, 2024 · 6dc8a92 · 6dc8a92
1 parent 854f050
commit 6dc8a92
Show file tree

Hide file tree

Showing 4 changed files with 166 additions and 15 deletions.
diff --git a/doc/beginner.md b/doc/beginner.md
@@ -1,24 +1,87 @@
+---
+jupytext:
+  text_representation:
+    format_name: myst
+kernelspec:
+  display_name: Python 3
+  name: python3
+---
+
 # Beginner's Guide to ESGF
 
-This guide is targetted at users who are new to obtaining CMIP data from ESGF. While many people work hard to provide the community access in an intuitive fashion, ESGF remains a data source for researchers who have some prior understanding about the data they wish to find and how they are organized. This tutorial is meant to gently expose the uninitiated to key concepts and step you through your first searches using `intake-esgf`.
+This guide is targetted at users who are new to obtaining [CMIP](https://www.wcrp-climate.org/wgcm-cmip) data from ESGF. While many people work hard to provide the community access in an intuitive fashion, ESGF remains a data source for researchers who have some prior understanding about the data they wish to find and how they are organized. This tutorial is meant to gently expose the uninitiated to key concepts and step you through your first searches using `intake-esgf`.
+
+## Which Variable Do We Need?
+
+At the highest level, ESGF stores data in *projects* such as `CMIP5` and `CMIP6`. While there are some similarities between projects, the *control vocabulary*, that is the metadata used to identify unique datasets, varies. In this tutorial we will explain some of the CMIP6 vocabulary, which is the default project for `intake-esgf`.
+
+Perhaps the most important search criteria to determine is the name of the variable you wish to use. `intake-esgf` has some functionality to assist.
+First, import and instantiate the catalog.
+
+```{code-cell}
+from intake_esgf import ESGFCatalog
+cat = ESGFCatalog()
+```
+
+Then you can use the catalog to perform a free text search for any word that may be related to the variable for which you are searching. In this case, we will search for `air temperature surface`.
+
+```{code-cell}
+cat.variable_info("air temperature surface")
+```
+
+This function returns a pandas dataframe which lists the name of several variables along with their units and standard names. From a perusal of this list, it appears that `tas` is the variable we want for this search. The dataframe index also shows us that the name of the control vocabulary is `variable_id`.
+
+## Control Vocabulary
+
+While we could now perform a search for `variable_id=tas`, this search will take quite some time. `intake-esgf` currently works better if we give it a better idea of what we wish to find. Simply put, we recommend constraining the search.
+
+One of the more useful search facets is the `experiment_id`, a unique identifier corresponding to the experiment. As part of the planning phase of the CMIP process, groups of researchers write papers detailing the specific method that a model is to be run to be included in an experiment. This allows modeling centers to follow the protocol if they wish to be part of the experiment. You can [browse](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html) the experiments to see the indentifiers and some basic information.
+
+One commonly used experiment is `historical`, where models are run using reconstructions of the historical earth state from 1850 until 2015. We will use this in our example search.
+
+```{code-cell}
+cat.search(variable_id="tas",experiment_id="historical")
+```
+
+This will populate an underlying pandas dataframe with the search results. The columns of that dataframe and unique values are presented . This exposes more of the control vocabulary for CMIP6. We have already explored `variable_id` and `experiment_id`. Now we explain more of the control vocabulary emphasizing what we find to be the more useful facets.
+
+- `source_id` - The identifier of the model. We use the term *source* instead of *model* in an attempt to make the control vocabulary more general and in the future unify vocabularies among projects. Each model or model version will have a unique string identifying which model and/or configuration was run, which can be [browsed](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_source_id.html).
+- `member_id` - The label for the variant of the model run (also known as `variant_label`). The precise meaning of these labels is specific to each model group. For CMIP6 these take the form `r...i...p...f...` where integers after each character reflect a separate run. Usually (but not with all models) the main result will be `r1i1p1f1`.
+  - `r` stands for the *realization*. Models can be run with small pertubations of the initial conditions to produce an ensemble. Model runs with the same `r` number started with the same initial conditions.
+  - `i` stands for the *initialization*. Models use different methods to spin up their states into quasi-equilibrium. This integer reflects the method that was used by the model.
+  - `p` stands for the *physics*. Modern models have many configuration options and while most submit results in a single configuration, this designation provides a method to distinguish among them if desired.
+  - `f` stands for the *forcing*. When multiple methods for forcing an experiment are possible, this label distinguishes among them.
+- `table_id` - Variables are organized into what CMIP refers to as tables. This tends to be a juxtaposition of a problem realm (`A` for atmosphere, `O` for ocean) along with time frequency (`mon` for month, `day` for day). Note that a variable can exist in several tables. In our search we see that there is `day` temperature data as well as monthly `Amon`.
+
+## Downloading Data
+
+We will refine our search to select a single model `CanESM5`, variant `r1i1p1f1`, and table `Amon`.
 
-## Why is it so hard?
+```{code-cell}
+cat.search(
+    variable_id="tas",
+    experiment_id="historical",
+    source_id="CanESM5",
+    member_id="r1i1p1f1",
+    table_id="Amon"
+)
+```
 
-Maybe you have some experience with searching for data in ESGF and found it complicated and difficult.
+Once your search has been sufficiently narrowed, you may download into a dictionary of [xarray](https://docs.xarray.dev/en/stable/) datasets.
 
-* The control vocabulary we use to describe datasets is technical, carefully chosen, and sometimes not intuitive to the beginner. We will disentangle some of this in this tutorial, but largely the control vocabulary is something the user must learn.
-* As the community conducts more phases of the CMIP process, our ideas about what this control vocabulary should be change. This means that while there will be similarities between the vocabulary of CMIP5 to CMIP6, they are not identical and must be learned.
-* The data that model centers produce is usually incomplete on some level. Modeling centers budget compute time and personnel to participate in different experiments, but their resources are finite. Not all models will successfully submit all variables for all the variants in all the experiments they run. The user is left to sort through what is there and make the most of it.
-* There is no single index that contains all the information about CMIP holdings worldwide. If you want to be certain that you have found everything, you have to search all of the index nodes.
-* While the web [interface](https://aims2.llnl.gov/search) will search in a distributed fashion over all index nodes, it will not report when an index has failed to return a response. In our experience, this happens often and can leave you with an impression that you have found everything there is, but in fact have not.
-* In order to reduce download times for users around the globe, some datasets are replicated to many different locations. When you use a web [interface](https://aims2.llnl.gov/search) to search, you may find many instances of the same dataset just stored in a different location. This leads to many search results to sort through and can cause some ambiguity of what should be selected.
+```{code-cell}
+dsd = cat.to_dataset_dict()
+```
 
-We have designed `intake-esgf` to hide as much of this complexity as we can to make the experience better for the user.
+Note that you do not need to explicitly search for cell measures such as `areacella`. These will be included [automatically](measures). The files are downloaded locally to a cache directory which mirrors the directory structure of the remote storate. So while the above code is how you download data, it is also how you load it into memory for your analysis scripts. There is no need to handle files in your working directory or write complicated code to load them into memory.
 
-## CMIP6 Control Vocabulary
+## Plotting
 
-At the highest level, ESGF stores data in *projects* such as `CMIP5` and `CMIP6`. While there are some similarities, the *control vocabulary*, that is the metadata used to identify unique datasets, varies. In the following we will explain some of this vocabulary by starting with the more relevant terms.
+In this example, we will just take a temporal mean and plot the result using matplotlib.
 
-* `experiment_id` - The identifier of the experiment. As part of the planning phase of the CMIP process, groups of researchers can write a paper detailing a specific method that a model is to be run. This allows modeling centers to read the paper and follow the protocol if they wish to be part of the experiment. You can browse the experiments [here](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html) to see the indentifiers and some basic information.
-* `source_id` - The identifier of the model. We use the term *source* instead of *model* in an attempt to make the control vocabular more general and in the future unify vocabularies among projects. Each model or model version will have a unique string identifying which model and/or configuration was run. [here](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_source_id.html)
-* `variable_id` - The identifier of the variable.
+```{code-cell}
+import matplotlib.pyplot as plt
+fig, ax = plt.subplots(figsize=(6, 4), tight_layout=True)
+ds = dsd["tas"]["tas"].mean(dim="time") - 273.15  # to [C]
+ds.plot(ax=ax, cmap="bwr", vmin=-40, vmax=40, cbar_kwargs={"label": "tas [C]"})
+```
diff --git a/intake_esgf/catalog.py b/intake_esgf/catalog.py
@@ -24,6 +24,7 @@
     parallel_download,
 )
 from intake_esgf.core import GlobusESGFIndex, SolrESGFIndex
+from intake_esgf.core.globus import variable_info
 from intake_esgf.database import create_download_database, get_download_rate_dataframe
 from intake_esgf.exceptions import NoSearchResults
 from intake_esgf.logging import setup_logging
@@ -658,3 +659,25 @@ def download_summary(
             )
         )
         return df
+
+    def variable_info(self, query: str, project: str = "CMIP6") -> pd.DataFrame:
+        """Return a dataframe with variable information from a query.
+
+        If you are new to searching for data in ESGF, you may not know how to figure out
+        what variables you need for your purpose.
+
+        Parameters
+        ----------
+        query
+            A search string whose contents we will use to search all record fields.
+        project
+            The project whose records we will search, defaults to `CMIP6`.
+
+        Returns
+        -------
+        df
+            A dataframe with the possibly relevant variables, their units, and various
+            name and description fields.
+
+        """
+        return variable_info(query, project)
diff --git a/intake_esgf/core/globus.py b/intake_esgf/core/globus.py
@@ -149,3 +149,49 @@ def from_tracking_ids(self, tracking_ids: list[str]) -> pd.DataFrame:
             df.append(record)
         df = pd.DataFrame(df)
         return df
+
+
+def variable_info(query: str, project: str = "CMIP6") -> pd.DataFrame:
+    """Return a dataframe with variable information from a query."""
+    # first we populate a list of related veriables
+    q = (
+        SearchQuery(query)
+        .add_filter("type", ["Dataset"])
+        .add_filter("project", [project])
+        .add_facet("variable_id", "variable_id")
+        .add_facet("variable", "variable")
+        .set_limit(0)
+    )
+    response = SearchClient().post_search("ea4595f4-7b71-4da7-a1f0-e3f5d8f7f062", q)
+    variables = list(
+        set(
+            [
+                bucket["value"]
+                for fr in response.data["facet_results"]
+                for bucket in fr["buckets"]
+            ]
+        )
+    )
+    # which facet do we use for variables?
+    var_facet = [fr["name"] for fr in response.data["facet_results"] if fr["buckets"]]
+    assert var_facet
+    var_facet = var_facet[0]
+    # then we loop through them and extract information for the user
+    df = []
+    for v in variables:
+        q = (
+            SearchQuery("")
+            .add_filter("type", ["Dataset"])
+            .add_filter("project", [project])
+            .add_filter(var_facet, [v])  # need to abstract this
+            .set_limit(1)
+        )
+        response = SearchClient().post_search("ea4595f4-7b71-4da7-a1f0-e3f5d8f7f062", q)
+        for doc in response.get("gmeta"):
+            content = doc["entries"][0]["content"]
+            columns = [var_facet]
+            columns += [key for key in content if "variable_" in key]
+            columns += [key for key in content if "name" in key]
+            df.append({key: content[key][0] for key in set(columns)})
+    df = pd.DataFrame(df).sort_values(var_facet).set_index(var_facet)
+    return df
diff --git a/intake_esgf/tests/test_basic.py b/intake_esgf/tests/test_basic.py
@@ -98,3 +98,22 @@ def test_remove_ensemble():
 def test_download_dbase():
     cat = ESGFCatalog()
     assert len(cat.download_summary().columns)
+
+
+def test_variable_info():
+    cat = ESGFCatalog()
+    df = cat.variable_info("temperature")
+    assert df.index.isin(
+        [
+            "sitemptop",
+            "ta",
+            "ta850",
+            "tas",
+            "tasmax",
+            "tasmin",
+            "thetao",
+            "tos",
+            "ts",
+            "tsl",
+        ]
+    ).all()