This project leverage xDeepDive to understand the intersection between existing paleo-data community resources such as the Neotoma Paleoecology Database, Global Paleofire Database or WorldClim. We use a list of terms compiled from researcher interviews that indicate likely sources of data used by researchers working in Holocene/Quaternary studies across a range of disciplines. Included in this set of terms are various tools associated with these different data resources.
The full list of terms and links used in the xDeepDive snippets search is in the data folder. We have 51 different resources identified from the interviews and 148 unique terms associated with these resources. Terms include URLs (e.g., neotomadb.org
for the Neotoma Paleoecology Database), programming libraries (e.g., rgbif
for the Global Biodiversity Information Facility) and alternate names, including initializations (e.g., APD for the African Pollen Database).
Using the xDeepDive snippets API we search for these terms to build a large table of DOIs, text snippets and database terms. Initial testing shows this table to be quite large (>100k rows), in part because some resources have low specificity in their naming.
This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.
Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to project forks or project branches.
All products of the Neotoma Paleoecology Database are licensed under an MIT License unless otherwise noted.
The repository is coded in Python and managed using the uv Package Manager. To run a script, first clone the repository and then, at the command line enter:
uv install
to install all neccessary dependencies.
To run one of the two main scripts run either:
uv run src/interop_dd.py
or
uv run src/networkgraph.py
The script to obtain text snippets is within the src/interoperability_deepdive folder. These scripts:
- Build the API URL -- gdd_snippets
- Page through the results list -- gddURLcall
- Process the JSON response from the API -- process_hits
The resulting object is a list
of dict
items, structured to be submitted to a CSV file with the following structure:
DOI | highlight | title | resource |
---|---|---|---|
10.1016/j.epsl.2018.10.016 | "identified using several publications (see supplementary information), the African Pollen Database, and" | The roles of climate and human land-use in the late Holocene rainforest crisis of Central Africa | African Pollen Database |
10.1016/j.crte.2008.12.009 | cm2 pe r year. The determination of 116 pollen taxa was made using the African Pollen Database reference | Climate and environmental change at the end of the Holocene Humid Period: A pollen record off Pakistan | African Pollen Database |
From this table we can manually examine records to assess match quality.
To effectively build the network model for these resources we look to co-occurrence of resources in publications. For example:
DOI | highlight | title | resource |
---|---|---|---|
10.1016/j.tree.2010.10.007 | databases, most notably those of North American Pollen Database (NAPD), European Pollen Database (EPD) | Exploring vegetation in the fourth dimension | European Pollen Database |
10.1016/j.tree.2010.10.007 | reviewed data from 36 beetle assemblages from Britain that are held in the BugsCEP database (http://www.bugscep.com) and exploited the specific | Exploring vegetation in the fourth dimension | BugsCEP |
Would indicate co-citation of the European Pollen Database and BugsCEP. A significant challenge in this analysis is knowing whether or not co-citation explicitly includes co-analysis of data (which may involve cross-walking and data translation). This work is challenging because of the structure of the xDeepDive API (which only returns individual "snippets", or sentences) and because of citation and data use patterns in publication.
Ultimately, the scripts first search for text strings, and then process the returned results (in the data/xdd_results
folder) into a single csv file containing only DOIs that report more than one data resource.
The output data -- data/doi_centric.json
-- is a JSON object with the following structure:
[
{
"doi": {
"type": "string",
"pattern": "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"
},
"resources": {
"type": "array",
"minLength": 2,
"items": {
"type": "string"
}
}
}
]
We care about several key measures: