changes to text and metadata (#18)

esgf2-us · Jan 23, 2024 · dc33712 · dc33712
1 parent e8d53d8
commit dc33712
Show file tree

Hide file tree

Showing 4 changed files with 36 additions and 31 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-[<img width=250px src=https://nvcl.energy.gov/content/images/project/earth-system-grid-federation-2-93.jpg>](https://climatemodeling.science.energy.gov/presentations/esgf2-building-next-generation-earth-system-grid-federation)
+[<img width=250px src=./doc/_static/logo.png>](https://climatemodeling.science.energy.gov/presentations/esgf2-building-next-generation-earth-system-grid-federation)
 
 # intake-esgf
 
@@ -7,34 +7,30 @@
 [![Continuous Integration][ci-badge]][rtd-link]
 [![Documentation Status][rtd-badge]][rtd-link]
 
-## Motivation
-A small intake and intake-esm *inspired* package under development in ESGF2.
-This package queries a sample index of the replicas hosted at Argonne National
-Laboratory and returns the response as a pandas dataframe, mimicing the
-interface developed by [intake-esm](https://github.com/intake/intake-esm). As a
-user accesses ESGF data, this package will maintain a local cache of files
-stored in `${HOME}/.esgf` as well as a log of searches and downloads in `${HOME}/.esgf/esgf.log`.
+## Overview
 
-## Design Principles
+`intake-esgf` is an [intake-esm](https://github.com/intake/intake-esm) *inspired* package under development in ESGF2. The main difference is that in place of querying a static index which is completely loaded at runtime, `intake-esgf` catalogs initialize empty and are populated by searching, querying ESGF index nodes.
 
-* The user wants their data as fast as possible without needing to understand where it is coming from or how ESGF is organized.
-* The search should concise enough that it becomes part of the analysis script and is also how data is loaded into memory.
+## Installation
 
-## Overview
+You may install `intake-esgf` using [pip](https://pypi.org/project/pip/):
+
+```bash
+python -m pip install intake-esgf
+```
+
+## Features
 
-* While implemented to inform new developments in ESGF2, this package can also point to ESGF1 indices (`ESGFCatalog(esgf1_indices=True)` for all nodes or `ESGFCatalog(esgf1_indices=["esgf-node.llnl.gov"])` to pick a subset).
-* When performing a search, we query all indices in parallel and merge the results in a pandas dataframe. The notion of which node the data lives on is transparent to the user.
-* As in `intake-esm`, once the search describes the datasets that you want to use in your analysis, call `cat.to_dataset_dict()`. The package will then get file information from the indices and then either load the data from local holdings (previously downloaded or directly available) or download it in parallel. They keys of the returned dictionary of xarray datasets use the dataset id and the minimal set of faceets to uniquely describe each dataset being returned.
-* If the script is run on resources where direct data access is available, you can set the path with `cat.set_esgf_data_root(...)` and then the package will prefer this location for loading data.  This makes your script portable and easily used in server-side computing.
-* The package harvests `cell_measure` information from the dataset attributes and then automatically finds, downloads, and associates the appropriate measures with each dataset. As many times the measures are not present for each experiment/variant, we relax search criteria until the appropriate measure matching the `source_id`/`grid_label` is found.
+For a full listing of features with code examples, please consult the [documentation](https://intake-esgf.readthedocs.io/en/latest/?badge=latest). In brief, `intake-esgf` aims to hide some of the complexity of obtaining ESGF data and get the user the data as fast as we can.
 
-## Future
+* Indices are queried in parallel and report when they fail to return a response. The results are aggregated and presented to the user as a [pandas](https://pandas.pydata.org/) DataFrame.
+* The locations of the data are hidden from the user. Internally we track which locations provide the user the fastest transfers and automatically favor them for you.
+* Files are downloaded in parallel into a local cache which mirrors the remote storage directory structure. They are returned to the user as a dictionary of [xarray](https://xarray.dev/) Datasets. Your search script then becomes the way you download data as well as how you load it into memory for your analysis.
+* Prior to downloading data, we first check that it is not already available locally. This could be because you had previously downloaded it, but also because you are working on a server that has direct access.
+* Cell measure information is harvested from your search results and automatically included in the returned datasets.
 
-* Currently the package will attempt to download files using the first https link that it finds. If a link fails, we continue on to the next link in the list. However, this list should be prioritized by what is fastest for the user. This is possibly something we can measure and adapt as the user uses the tool.
-* We currently use the https links to download the data. However, we plan to add a `stream=True` option to `to_dataset_dict` which would not download but rather pass OPeNDAP/THREDDS links to the xarray constructor.
-* A growing number of file entries now also contain Globus links. We will add authentication and then the option to select and endpoint to download the current catalog to.
 
 [ci-badge]: https://github.com/esgf2-us/intake-esgf/actions/workflows/ci.yml/badge.svg?branch=main
 [ci-link]: https://github.com/esgf2-us/intake-esgf/actions/workflows/ci.yml
-[rtd-badge]: https://readthedocs.org/projects/intake-esm/badge/?version=latest
-[rtd-link]: https://intake-esm.readthedocs.io/en/latest/?badge=latest
+[rtd-badge]: https://readthedocs.org/projects/intake-esgf/badge/?version=latest
+[rtd-link]: https://intake-esgf.readthedocs.io/en/latest/?badge=latest
diff --git a/doc/beginner.md b/doc/beginner.md
@@ -21,5 +21,4 @@ At the highest level, ESGF stores data in *projects* such as `CMIP5` and `CMIP6`
 
 * `experiment_id` - The identifier of the experiment. As part of the planning phase of the CMIP process, groups of researchers can write a paper detailing a specific method that a model is to be run. This allows modeling centers to read the paper and follow the protocol if they wish to be part of the experiment. You can browse the experiments [here](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html) to see the indentifiers and some basic information.
 * `source_id` - The identifier of the model. We use the term *source* instead of *model* in an attempt to make the control vocabular more general and in the future unify vocabularies among projects. Each model or model version will have a unique string identifying which model and/or configuration was run. [here](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_source_id.html)
-
 * `variable_id` - The identifier of the variable.
diff --git a/doc/quickstart.md b/doc/quickstart.md
@@ -9,6 +9,14 @@ kernelspec:
 
 # Quickstart
 
+To get started, you will need to install `intake-esgf` using [pip](https://pypi.org/project/pip/):
+
+```bash
+python -m pip install intake-esgf
+```
+
+Next you will need to import the `ESGFCatalog` and `matplotlib` for plotting later in the document.
+
 ```{code-cell}
 from intake_esgf import ESGFCatalog
 import matplotlib.pyplot as plt
@@ -19,18 +27,18 @@ import matplotlib.pyplot as plt
 A catalog in `intake-esgf` initializes empty. This is because while intake-esm
 loads a large file-based database into memory, we are going to populate a
 catalog by searching one or many index nodes. The ESGFCatalog is configured by
-default to query a Globus (ElasticSearch) based index which has information
-about holdings at the Argonne Leadership Computing Facility (ALCF) only. We will
-demonstrate how this may be expanded to include other nodes later.
+default to query a Globus-based index which has information about holdings at
+the Argonne Leadership Computing Facility (ALCF) only. We will demonstrate how
+this may be expanded to include other nodes [later](configure).
 
 ```{code-cell}
 cat = ESGFCatalog()
 print(cat)  # <-- nothing to see here yet
 ```
 
 To populate the catalog, perform a search using the traditional facets. If you
-are not familiar with these, we recommend you starting with our
-[beginner](beginner) tutorial.
+are not familiar with these, we recommend you starting with
+our[beginner](beginner) tutorial.
 
 ```{code-cell}
 cat.search(

diff --git a/setup.cfg b/setup.cfg
@@ -3,16 +3,18 @@ name = intake-esgf
 author = Nathan Collier
 author_email = [email protected]
 license = BSD-3-Clause
-description = An intake and intake-esm inspired catalog for ESGF
+description = An intake-esm inspired catalog for ESGF
+long_description = file: README.md
 long_description_content_type=text/markdown
 classifiers =
-    Development Status :: 1 - Planning
+    Development Status :: 4 - Beta
     License :: OSI Approved :: BSD License
     Operating System :: OS Independent
     Programming Language :: Python :: 3
     Programming Language :: Python :: 3.9
     Programming Language :: Python :: 3.10
     Programming Language :: Python :: 3.11
+    Programming Language :: Python :: 3.12
     Intended Audience :: Science/Research
     Topic :: Scientific/Engineering
Original file line number	Diff line number	Diff line change
Expand Up		@@ -21,5 +21,4 @@ At the highest level, ESGF stores data in projects such as `CMIP5` and `CMIP6`

		* `experiment_id` - The identifier of the experiment. As part of the planning phase of the CMIP process, groups of researchers can write a paper detailing a specific method that a model is to be run. This allows modeling centers to read the paper and follow the protocol if they wish to be part of the experiment. You can browse the experiments [here](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html) to see the indentifiers and some basic information.
		* `source_id` - The identifier of the model. We use the term source instead of model in an attempt to make the control vocabular more general and in the future unify vocabularies among projects. Each model or model version will have a unique string identifying which model and/or configuration was run. [here](https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_source_id.html)

		* `variable_id` - The identifier of the variable.