Skip to content

Commit

Permalink
changes to text and metadata (#18)
Browse files Browse the repository at this point in the history
  • Loading branch information
nocollier authored Jan 23, 2024
1 parent e8d53d8 commit dc33712
Show file tree
Hide file tree
Showing 4 changed files with 36 additions and 31 deletions.
42 changes: 19 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[<img width=250px src=https://nvcl.energy.gov/content/images/project/earth-system-grid-federation-2-93.jpg>](https://climatemodeling.science.energy.gov/presentations/esgf2-building-next-generation-earth-system-grid-federation)
[<img width=250px src=./doc/_static/logo.png>](https://climatemodeling.science.energy.gov/presentations/esgf2-building-next-generation-earth-system-grid-federation)

# intake-esgf

Expand All @@ -7,34 +7,30 @@
[![Continuous Integration][ci-badge]][rtd-link]
[![Documentation Status][rtd-badge]][rtd-link]

## Motivation
A small intake and intake-esm *inspired* package under development in ESGF2.
This package queries a sample index of the replicas hosted at Argonne National
Laboratory and returns the response as a pandas dataframe, mimicing the
interface developed by [intake-esm](https://github.com/intake/intake-esm). As a
user accesses ESGF data, this package will maintain a local cache of files
stored in `${HOME}/.esgf` as well as a log of searches and downloads in `${HOME}/.esgf/esgf.log`.
## Overview

## Design Principles
`intake-esgf` is an [intake-esm](https://github.com/intake/intake-esm) *inspired* package under development in ESGF2. The main difference is that in place of querying a static index which is completely loaded at runtime, `intake-esgf` catalogs initialize empty and are populated by searching, querying ESGF index nodes.

* The user wants their data as fast as possible without needing to understand where it is coming from or how ESGF is organized.
* The search should concise enough that it becomes part of the analysis script and is also how data is loaded into memory.
## Installation

## Overview
You may install `intake-esgf` using [pip](https://pypi.org/project/pip/):

```bash
python -m pip install intake-esgf
```

## Features

* While implemented to inform new developments in ESGF2, this package can also point to ESGF1 indices (`ESGFCatalog(esgf1_indices=True)` for all nodes or `ESGFCatalog(esgf1_indices=["esgf-node.llnl.gov"])` to pick a subset).
* When performing a search, we query all indices in parallel and merge the results in a pandas dataframe. The notion of which node the data lives on is transparent to the user.
* As in `intake-esm`, once the search describes the datasets that you want to use in your analysis, call `cat.to_dataset_dict()`. The package will then get file information from the indices and then either load the data from local holdings (previously downloaded or directly available) or download it in parallel. They keys of the returned dictionary of xarray datasets use the dataset id and the minimal set of faceets to uniquely describe each dataset being returned.
* If the script is run on resources where direct data access is available, you can set the path with `cat.set_esgf_data_root(...)` and then the package will prefer this location for loading data. This makes your script portable and easily used in server-side computing.
* The package harvests `cell_measure` information from the dataset attributes and then automatically finds, downloads, and associates the appropriate measures with each dataset. As many times the measures are not present for each experiment/variant, we relax search criteria until the appropriate measure matching the `source_id`/`grid_label` is found.
For a full listing of features with code examples, please consult the [documentation](https://intake-esgf.readthedocs.io/en/latest/?badge=latest). In brief, `intake-esgf` aims to hide some of the complexity of obtaining ESGF data and get the user the data as fast as we can.

## Future
* Indices are queried in parallel and report when they fail to return a response. The results are aggregated and presented to the user as a [pandas](https://pandas.pydata.org/) DataFrame.
* The locations of the data are hidden from the user. Internally we track which locations provide the user the fastest transfers and automatically favor them for you.
* Files are downloaded in parallel into a local cache which mirrors the remote storage directory structure. They are returned to the user as a dictionary of [xarray](https://xarray.dev/) Datasets. Your search script then becomes the way you download data as well as how you load it into memory for your analysis.
* Prior to downloading data, we first check that it is not already available locally. This could be because you had previously downloaded it, but also because you are working on a server that has direct access.
* Cell measure information is harvested from your search results and automatically included in the returned datasets.

* Currently the package will attempt to download files using the first https link that it finds. If a link fails, we continue on to the next link in the list. However, this list should be prioritized by what is fastest for the user. This is possibly something we can measure and adapt as the user uses the tool.
* We currently use the https links to download the data. However, we plan to add a `stream=True` option to `to_dataset_dict` which would not download but rather pass OPeNDAP/THREDDS links to the xarray constructor.
* A growing number of file entries now also contain Globus links. We will add authentication and then the option to select and endpoint to download the current catalog to.

[ci-badge]: https://github.com/esgf2-us/intake-esgf/actions/workflows/ci.yml/badge.svg?branch=main
[ci-link]: https://github.com/esgf2-us/intake-esgf/actions/workflows/ci.yml
[rtd-badge]: https://readthedocs.org/projects/intake-esm/badge/?version=latest
[rtd-link]: https://intake-esm.readthedocs.io/en/latest/?badge=latest
[rtd-badge]: https://readthedocs.org/projects/intake-esgf/badge/?version=latest
[rtd-link]: https://intake-esgf.readthedocs.io/en/latest/?badge=latest
1 change: 0 additions & 1 deletion doc/beginner.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,4 @@ At the highest level, ESGF stores data in *projects* such as `CMIP5` and `CMIP6`

* `experiment_id` - The identifier of the experiment. As part of the planning phase of the CMIP process, groups of researchers can write a paper detailing a specific method that a model is to be run. This allows modeling centers to read the paper and follow the protocol if they wish to be part of the experiment. You can browse the experiments [here](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html) to see the indentifiers and some basic information.
* `source_id` - The identifier of the model. We use the term *source* instead of *model* in an attempt to make the control vocabular more general and in the future unify vocabularies among projects. Each model or model version will have a unique string identifying which model and/or configuration was run. [here](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_source_id.html)

* `variable_id` - The identifier of the variable.
18 changes: 13 additions & 5 deletions doc/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,14 @@ kernelspec:

# Quickstart

To get started, you will need to install `intake-esgf` using [pip](https://pypi.org/project/pip/):

```bash
python -m pip install intake-esgf
```

Next you will need to import the `ESGFCatalog` and `matplotlib` for plotting later in the document.

```{code-cell}
from intake_esgf import ESGFCatalog
import matplotlib.pyplot as plt
Expand All @@ -19,18 +27,18 @@ import matplotlib.pyplot as plt
A catalog in `intake-esgf` initializes empty. This is because while intake-esm
loads a large file-based database into memory, we are going to populate a
catalog by searching one or many index nodes. The ESGFCatalog is configured by
default to query a Globus (ElasticSearch) based index which has information
about holdings at the Argonne Leadership Computing Facility (ALCF) only. We will
demonstrate how this may be expanded to include other nodes later.
default to query a Globus-based index which has information about holdings at
the Argonne Leadership Computing Facility (ALCF) only. We will demonstrate how
this may be expanded to include other nodes [later](configure).

```{code-cell}
cat = ESGFCatalog()
print(cat) # <-- nothing to see here yet
```

To populate the catalog, perform a search using the traditional facets. If you
are not familiar with these, we recommend you starting with our
[beginner](beginner) tutorial.
are not familiar with these, we recommend you starting with
our[beginner](beginner) tutorial.

```{code-cell}
cat.search(
Expand Down
6 changes: 4 additions & 2 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,18 @@ name = intake-esgf
author = Nathan Collier
author_email = [email protected]
license = BSD-3-Clause
description = An intake and intake-esm inspired catalog for ESGF
description = An intake-esm inspired catalog for ESGF
long_description = file: README.md
long_description_content_type=text/markdown
classifiers =
Development Status :: 1 - Planning
Development Status :: 4 - Beta
License :: OSI Approved :: BSD License
Operating System :: OS Independent
Programming Language :: Python :: 3
Programming Language :: Python :: 3.9
Programming Language :: Python :: 3.10
Programming Language :: Python :: 3.11
Programming Language :: Python :: 3.12
Intended Audience :: Science/Research
Topic :: Scientific/Engineering

Expand Down

0 comments on commit dc33712

Please sign in to comment.