-
Notifications
You must be signed in to change notification settings - Fork 20
ESGF_Search_REST_API
The ESGF search service exposes a RESTful URL that can be used by clients (browsers and desktop clients) to query the contents of the underlying search index, and return results matching the given constraints. Because of the distributed capabilities of the ESGF search, the URL at any Index Node can be used to query that Node only, or all Nodes in the ESGF system.
The general syntax of the ESGF search service URL is:
http://<base_search_URL>/search?[keyword parameters as (name, value) pairs][facet parameters as (name,value) pairs]
where <base_search_URL> is the base URL of the search service at a given Index Node.
All parameters (keyword and facet) are optional. Also, the value of all parameters must be URL-encoded, so that the complete search URL is well formed.
Keyword parameters are query parameters that have reserved names, and are interpreted by the search service to control the fundamental nature of a search request: where to issue the request to, how many results to return, etc.
The following keywords are currently used by the system - see below for usage examples:
-
facets= to return facet values and counts
-
shards= to specify an explicit list of shards to be queried
-
offset= , limit= to paginate through the available results (default: offset=0, limit=10)
-
fields= to return only specific metadata fields for each matching result (default: fields=*)
-
format= to specify the response document output format
Facet parameters are "search categories" that can be used to apply constraints to the search, and thus reduce the number of results returned. Internally, facets are metadata fields (single valued or multi-valued) that are stored for each search record. The search service will select records for which the metadata field values match the corresponding facet constraints.
The following facets are core system facets , and their names are reserved in the system. These facets can be used as valid query parameters at _ all _ sites in the federation.
-
query= for free text searches (default: query=*)
-
distrib=true to execute a distributed query, distrib=false to execute a local query (default: distrib=true)
-
id , master_id , instance_id : core record identifiers carrying different semantics - see later for detailed explanation.
-
title : record (short) title
-
description : record (longer) description
-
type : denotes the intrinsic type of the record. Currently supported values: Dataset, File, Aggregation (default: Dataset)
-
replica : indicates wether the record is the "master" copy, or a replica. Use replica=false to return only originals, replica=true to return only replicas (default: no replica flag specified, i.e. return both replicas and originals)
-
latest : indicates wether the record is the latest available version, or a previous version. Use latest=true to return only the latest version of all records, latest=false to return previous versions (default: no latest flag specified, i.e. return all versions)
-
data_node : indicates the Data Node where the data is stored
-
index_node : the Index Node where the data is published
-
version : the record version (a string)
-
timestamp : the date and time when the record was last modified
-
url : specific URL(s) to access the record
-
access : high level access capability available for a record
-
xlink : reference to external record documentation, such as technical notes
-
size : record size (for Datasets or Files)
-
checksum , checksum_type : file checksum value and type
-
number_of_files : number of files contained in a dataset
-
number_of_aggregations : number of aggregations in a dataset
-
dataset_id : the "id" value of the enclosing dataset (Files and Aggregations only)
-
tracking_id : the UUID assigned to a File by some special publication software, if available
-
drs_id : a templated string assigned to a Dataset by some special publication software, if available. Note: this field is deprecated .
-
start= , end= to execute a temporal range query
-
bbox=[west,south,east,north] to execute a spatial coverage query
-
from= , to= to execute a query based on the record last update date and time
Additionally, each ESGF Index Node can harvest and make available additional custom facets that are relevant to its projects and users. For example, most Index Nodes support the set of CMIP5 facets , plus others. These custom facets are configured by the Node administrator in the file /esgf/config/facets.properties and can be discovered by the user through the following query:
http://<base_search_URL>/search?facets=*&distrib=false&limit=0
Example:
- Determine all the allowed facet names and values at a specific site: http://esg-datanode.jpl.nasa.gov/esg-search/search?facets=*&limit=0&distrib=false
The following set of facets is supported by most ESGF Index Nodes in the federation, and can be used to discover/query/retrieve CMIP5 data. (the fa
-
CF Standard Name: cf_standard_name
-
Ensemble: ensemble
-
Experiment: experiment
-
Experiment Family: experiment_family
-
Institute: institute
-
MIP Table: cmor_table
-
Model: model
-
Project: project
-
Product: product
-
Realm: realm
-
Time Frequency: time_frequency
-
Variable: variable
-
Variable Long Name: variable_long_name
-
Instrument: source_id
Example:
- Determine all the possible values of the "model", "experiment" and "project" facets throughout the federation: http://esg-datanode.jpl.nasa.gov/esg-search/search?facets=model,experiment,project&limit=0
If no parameters at all are specified, the search service will execute a query using all the default values, specifically:
- query=* (query all records)
- distrib=true (execute a distributed search)
- type=Dataset (return results of type "Dataset")
Example:
The keyword parameter query= can be specified to execute a query that matches the given text _ anywhere _ in the records metadata fields. The parameter value can be any expression following the Apache Lucene query syntax (because it is passed "as-is" to the back-end Solr query), and must be URL- encoded.
Examples:
-
Search for any text, anywhere: http://esg-datanode.jpl.nasa.gov/esg-search/search?query=* (the default value of the query parameter)
-
Search for _ humidity _ in all metadata fields: http://esg-datanode.jpl.nasa.gov/esg-search/search?query=humidity
-
Search for the exact sentence _ specific humidity _ in all metadata fields: http://esg-datanode.jpl.nasa.gov/esg-search/search?query=%22specific%20humidity%22
-
Search for the words _ specific _ AND _ humidity _ , but not necessarily in an exact sequence: http://esg-datanode.jpl.nasa.gov/esg-search/search?query=specific%20humidity
-
Search for the word _ observations _ ONLY in the metadata field _ product _ : http://esg-datanode.jpl.nasa.gov/esg-search/search?query=product:observations
-
Using logical AND: http://esg-datanode.jpl.nasa.gov/esg-search/search?query=airs%20AND%20humidity (must use upper case "AND")
-
Using logical OR: http://esg-datanode.jpl.nasa.gov/esg-search/search?query=airs%20OR%20humidity (must use upper case "OR"). This is the same as using simply a blank space: http://esg-datanode.jpl.nasa.gov/esg-search/search?query=airs%20humidity )
-
Search for all datasets that match an id pattern: http://esg-datanode.jpl.nasa.gov/esg-search/search?query=id:obs4MIPs.NASA-JPL.AIRS.*
A request to the search service can be constrained to return only those records that match specific values for one or more facets. Specifically, a facet constraint is expressed through the general form: <facet_name>=<facet_value> , where <facet_name> is chosen from the controlled vocabulary of facet names configured at each site, and <facet_value> must match _ exactly _ one of the possible values for that particular facet.
When specifying more than one facet constraint in the request, multiple values for the same facet are combined with a logical OR, while multiple values for different facets are combined with a logical AND . For example, _ experiment=decadal2000&variable=hus _ will return all records that match _ experiment=decadal2000 _ AND variable= _ hus _ , while _ variable=hus&variable=ta _ will return all records that match variable= _ hus _ OR variable= _ ta _ .
A facet constraint can be negated by using the != operator. For example, _ model!=CCSM _ searches for all items that do NOT match the CCSM model. Note that all negative facets are combined in logical AND, for example _ model!=CCSM&model!=HadCAM _ searches for all items that do not match _ CCSM _ , and do not match _ HadCAM _ .
By default, no facet counts are returned in the output document. Facet counts must be explicitly requested by specifying the facet names individually (for example: facets= _ experiment,model _ ) or via the special notation _ facets=* _ . The facets list must be comma-separated, and white spaces are ignored. Note also that at this time, the special notation _ facets=* _ will only count those facets that are explicitly configured in the file _ application- context.xml _ .
If facet counts is requested, facet values are sorted alphabetically (facet.sort=lex) , and all facet values are returned (facet.limit=-1), provided they match one or more records (facet.mincount=1)
The facet type must be always specified as part of any request to the ESGF search services, so that the appropriate records can be examined and returned. If not specified explicitly, the default value is type=Dataset .
Examples:
-
http://esg-datanode.jpl.nasa.gov/esg-search/search?cf_standard_name=air_temperature
-
http://esg-datanode.jpl.nasa.gov/esg-search/search?cf_standard_name=air_temperature&project=obs4MIPs
-
Combining two values of the same facet with a logical _ OR _ : http://esg-datanode.jpl.nasa.gov/esg-search/search?project=obs4MIPs&variable=hus&variable=ta (search for all observational files that have variable _ ta _ or _ hus _ )
-
Using a negative facet:
-
http://esg-datanode.jpl.nasa.gov/esg-search/search?project=obs4MIPs&variable=hus&variable=ta&model!=Obs-AIRS (search for all observational datasets that have variable _ ta _ or _ hus _ , excluding those produced by _ AIRS _ )
-
http://esg-datanode.jpl.nasa.gov/esg-search/search?project=obs4MIPs&variable!=ta&variable!=huss (search for all observational datasets that do not contain neither variable _ ta _ nor variable _ huss _ )
-
-
Search by tracking id: http://esg-datanode.jpl.nasa.gov/esg-search/search?type=File&tracking_id=2209a0d0-9b77-4ecb-b2ab-b7ae412e7a3f
-
Search by checksum: http://esg-datanode.jpl.nasa.gov/esg-search/search?type=File&checksum=cbff465c9cd8c9833fd7b85235be2d47
-
Issue a query for all supported facets and their values at one site, while returning no results (note that only facets with one or more values are returned):
The keyword parameters start= and/or end= can be used to query for data with temporal coverage that _ overlaps _ the specified range. The parameter values can either be date-times in the format "YYYY-MM-DDTHH:MM:SSZ" (UTC ISO 8601 format), or special values supported by the Solr DateMath syntax.
Examples:
-
Search for data in the past year: http://esg-datanode.jpl.nasa.gov/esg-search/search?start=NOW-1YEAR (translates into the constraint datetime_stop > NOW-1YEAR)
-
Search for data before the year 2000: http://esg-datanode.jpl.nasa.gov/esg-search/search?end=2000-01-01T00:00:00Z (translates into the constraint datetime_start < 2000-01-01)
The keyword parameter bbox=[west, south, east, north] can be used to query for data with spatial coverage that _ overlaps _ the given bounding box.
Examples:
- http://esg-datanode.jpl.nasa.gov/esg-search/search?bbox=%5B-10,-10,+10,+10%5D (translates to: east_degrees:[-10 TO ] AND north_degrees:[-10 TO ] AND west_degrees:[ TO 10] AND south_degrees:[ TO 10])
The keyword parameters from= and/or to= can be used to query for data that was last updated in a given time range. These queries are executed against the "timestamp" field of the Solr records, which represents the date and time when the record was last modified. Note that if the timestamp cannot be set from the source metadata for that record, it is left unassigned so not to bias the query for records that have a valid timestamp.
When parsing THREDDS catalogs, the timestamp is assigned from the value of the properties creation_time (for datasets) and mod_time (for files), which are interpreted in the local time zone (local to the harvesting agent), and converted to UTC for input into the index. For example, the input value of creation_time="2012-03-15 12:59:09" (in the PDT time zone) becomes timestamp="2012-03-15T19:59:09Z".
The constraint values can either be date-times in the format "YYYY-MM- DDTHH:MM:SSZ" (UTC ISO 8601 format), or special values supported by the Solr DateMath syntax.
Examples:
-
http://esg-datanode.jpl.nasa.gov/esg-search/search?from=2010-10-19T22:00:00Z&to=NOW
-
http://esg-datanode.jpl.nasa.gov/esg-search/search?from=2010-10-19T22:00:00Z
-
http://esg-datanode.jpl.nasa.gov/esg-search/search?to=2011-10-23T22:00:00Z
-
http://esg-datanode.jpl.nasa.gov/esg-search/search?from=NOW-1DAY
The keyword parameter distrib= can be used to control whether the query is executed versus the local Index Noe only, or distributed to all other Nodes in the federation. If not specified, the default value distrib=true is assumed.
Examples:
-
Search for all datasets in the federation: http://esg-datanode.jpl.nasa.gov/esg-search/search?distrib=true
-
Search for all datasets at one Node only: http://esg-datanode.jpl.nasa.gov/esg-search/search?distrib=false
By default, a distributed query ( _ distrib=true _ ) targets all ESGF Nodes in the current peer group, i.e. all nodes that are listed in the local configuration file /esg/config/esgf_shards.xml , which is continuously updated by the local node manager to reflect the latest state of the federation. It is possible to execute a distributed search that targets only one or more specific nodes, by specifying them in the _ shards _ parameter, as such: _ shards=hostname1:port1/solr,hostname2:port2/solr,.... _ . Note that the explicit shards value is ignored if _ distrib=false _ (but distrib=true by default if not otherwise specified).
Examples:
-
Query for CMIP5 data at the PCMDI and BADC sites only: http://esg-datanode.jpl.nasa.gov/esg-search/search?project=CMIP5&shards=pcmdi9.llnl.gov:8983/solr,esgf-index1.ceda.ac.uk:8983/solr
-
Query for all files belonging to a given dataset at one site only: http://esg-datanode.jpl.nasa.gov/esg-search/search?type=File&shards=esg-datanode.jpl.nasa.gov:8983/solr&dataset_id=obs4MIPs.CNES.AVISO.mon.v1%7Cesg-datanode.jpl.nasa.gov
Replicas (Datasets and Files) are distinguished from the original record (a.k.a. the _ master _ ) in the Solr index by the value of two special keywords:
-
_ replica _ : a flag that is set to false for master records, true for replica records.
-
_ master_id _ : a string that is identical for the master and all replicas of a given logical record (Dataset or File).
By default, a query returns all records (masters and replicas) matching the search criteria, i.e. no _ replica _ constraint is used. To return only master records, use _ replica=false _ , to return only replicas, use _ replica=true _ . To search for all identical Datasets or Files (i.e. for the master AND replicas of a Dataset or File), use _ master_id=... _ .
Examples:
-
Search for all datasets in the system (masters and replicas): http://esg-datanode.jpl.nasa.gov/esg-search/search
-
Search for just master datasets, no replicas: http://esg-datanode.jpl.nasa.gov/esg-search/search?replica=false
-
Search for just replica datasets, no masters: http://esg-datanode.jpl.nasa.gov/esg-search/search?replica=true
-
Search for the master AND replicas of a given dataset: http://esg-datanode.jpl.nasa.gov/esg-search/search?master_id=cmip5.output1.BCC.bcc-csm1-1.1pctCO2.day.atmos.day.r1i1p1
-
Search for the master and replicas of a given file: http://esg-datanode.jpl.nasa.gov/esg-search/search?type=File&master_id=cmip5.output1.BCC.bcc-csm1-1.1pctCO2.day.atmos.day.r1i1p1.huss_day_bcc-csm1-1_1pctCO2_r1i1p1_01600101-02991231.nc
By default, a query to the ESGF search services will return all versions of the matching records (Datasets or Files). To only return the very last, up-to- date version include _ latest=true _ . To return a specific version, use _ version= _ . Using _ latest=false _ will return only datasets that were _ superseded _ by newer versions.
Examples:
-
Search for all latest CMIP5 datasets: http://esg-datanode.jpl.nasa.gov/esg-search/search?project=CMIP5&latest=true
-
Search for all versions of a given dataset: http://esg-datanode.jpl.nasa.gov/esg-search/search?project=CMIP5&master_id=cmip5.output1.NSF-DOE-NCAR.CESM1-CAM5-1-FV2.historical.mon.atmos.Amon.r1i1p1&facets=version
-
Search for a specific version of a given dataset: http://esg-datanode.jpl.nasa.gov/esg-search/search?project=CMIP5&master_id=cmip5.output1.NSF-DOE-NCAR.CESM1-CAM5-1-FV2.historical.mon.atmos.Amon.r1i1p1&version=20120712
By default, a query to the search service will return the first 10 records matching the given constraints. The offset into the returned results, and the total number of returned results, can be changed through the keyword parameters limit= and offset= . The system imposes a maximum value of limit <= 10,000.
Examples:
-
Query for 100 CMIP5 datasets in the system: http://esg-datanode.jpl.nasa.gov/esg-search/search?project=CMIP5&limit=100
-
Query for the next 100 CMIP5 datasets in the system: http://esg-datanode.jpl.nasa.gov/esg-search/search?project=CMIP5&limit=100&offset=100
By default, the results returned by a search are unsorted. The query parameter sort=true can be used to sort the returned results in inverse order of last modification time, i.e. to return the most up to date records first.
Example:
- Return the most recent datasets with variable "hus" published to the ESGF system: http://esg-datanode.jpl.nasa.gov/esg-search/search?variable=hus&sort=true&fields=timestamp,variable
The keyword parameter output= can be used to request results in a specific output format. Currently the only available options are Solr/XML (the default) and Solr/JSON.
Examples:
-
Request results in Solr XML format: http://esg-datanode.jpl.nasa.gov/esg-search/search?format=application%2Fsolr%2Bxml
-
Request results in Solr JSON format: http://esg-datanode.jpl.nasa.gov/esg-search/search?format=application%2Fsolr%2Bjson
By default, all available metadata fields are returned for each result. The keyword parameter fields= can be used to limit the number of fields returned in the response document, for each matching result. The list must be comma-separated, and white spaces are ignored. Use _ fields=* _ to return all fields (same as not specifiying it, since it is the default). Note that the pseudo field _ score _ is always appended to any fields list.
Examples:
-
Return all available metadata fields for CMIP5 datasets: http://esg-datanode.jpl.nasa.gov/esg-search/search?project=CMIP5&fields=*
-
Return only the _ model _ and _ experiment _ fields for CMIP5 datasets: http://esg-datanode.jpl.nasa.gov/esg-search/search?project=CMIP5&fields=model,experiment
Each search record in the system is assigned the following identifiers (all of type string):
- id : universally unique for each record across the federation, i.e. specific to each dataset or file, version and replica (and the data node storing the data). It is intended to be "opaque", i.e. it should not be parsed by clients to extract any information.
* Example: id=obs4MIPs.CNES.AVISO.mon.v1|esg-datanode.jpl.nasa.gov
- master_id : same for all replicas and versions across the federation. When parsing THREDDS catalogs, it is extracted from the properties "dataset_id" or "file_id".
* Example: obs4MIPs.CNES.AVISO.mon
- instance_id : same for all replicas across federation, but specific to each version. When parsing THREDDS catalogs, it is extracted from ID attribute of tag in THREDDS (for both Datasets and Files).
* Example: obs4MIPs.CNES.AVISO.mon.v1
Note also that the record version is the same for all replicas of that record, but different across versions. Examples:
- version=20120201
- version=1
In the returned Solr XML output document, URLs that are access points for Datasets and Files are encoded as 3-tuple of the form url|mime type|service name , where the fields are separated by the _ | _ character, and the _ mime type _ and _ service name _ are chosen from the ESGF controlled vocabulary.
Examples of Dataset URLs:
Examples of File URLs:
-
gsiftp://esg.anl.gov:2811//Hiram/atmos/av/annual_1year/atmos.1980.ann.nc|application/gridftp|GridFTP
The same RESTful API that is used to query the ESGF search services can also be used, with minor modifications, to generate a Wget script to download all files matching the given constraints. Specifically, each ESGF Index Node exposes the following URL for generating Wget scripts:
http://<base_search_URL>/wget?[keyword parameters as (name, value) pairs][facet parameters as (name,value) pairs]
where again <base_search_URL> is the base URL of the search service at a given Index Node. The only syntax differences with respect to the search URL are:
-
The keyword parameter _ type= _ is not allowed, as the wget URL always assumes type=File .
-
The keyword parameter _ format= _ is not allowed, as the wget URL always returns a shell script as response document.
-
The keyword parameter _ limit= _ is assigned a default value of limit=1000 (and must still be limit < 10,000).
-
The keyword parameter _ download_structure= _ is used for defining a relative directory structure for the download by using the facets value (i.e. of Files and not Datasets). For example, if you want to create a CMIP5 directory structure on your local computer and to copy your download files into this structure, run the Wget script created by http://esgf-data.dkrz.de/esg-search/wget?download_structure=project,product,institute,model,experiment,time_frequency,realm,cmor_table,ensemble,version,variable&project=CMIP5&experiment=historical&cmor_table=Amon&variable=tas&variable=pr
-
The keyword parameter _ download_emptypath= _ is used to define what to do it download_structure is set and the facet return no value (e.g. mixing files from CMIP5 and obs4MIP and selecting _ instrument _ as a facet value will result in all CMIP5 files returning an empty value)
A typical workflow pattern consists in first identifying all datasets or files matching some scientific criteria, then changing the request URL from "/search?" to "/wget?" to generate the corresponding shell scripts for bulk download of files.
Example:
- Download all observational files with variable _ hus _ : http://esg-datanode.jpl.nasa.gov/esg-search/wget/?variable=hus&project=obs4MIPs&distrib=false
For more information on the wget scrip see ESGF_wget