Skip to content

A repository for managing the FAIROS Quaternary Interoperability project.

License

Notifications You must be signed in to change notification settings

NeotomaDB/Interoperability_DeepDive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Examining Interoperability Among Databases

This project leverage xDeepDive to understand the intersection between existing paleo-data community resources such as the Neotoma Paleoecology Database, Global Paleofire Database or WorldClim. We use a list of terms compiled from researcher interviews that indicate likely sources of data used by researchers working in Holocene/Quaternary studies across a range of disciplines. Included in this set of terms are various tools associated with these different data resources.

The full list of terms and links used in the xDeepDive snippets search is in the data folder. We have 51 different resources identified from the interviews and 148 unique terms associated with these resources. Terms include URLs (e.g., neotomadb.org for the Neotoma Paleoecology Database), programming libraries (e.g., rgbif for the Global Biodiversity Information Facility) and alternate names, including initializations (e.g., APD for the African Pollen Database).

Using the xDeepDive snippets API we search for these terms to build a large table of DOIs, text snippets and database terms. Initial testing shows this table to be quite large (>100k rows), in part because some resources have low specificity in their naming.

Contributors

This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.

  • ORCID Simon Goring

Tips for Contributing

Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to project forks or project branches.

All products of the Neotoma Paleoecology Database are licensed under an MIT License unless otherwise noted.

Using the Repository

The repository is coded in Python and managed using the uv Package Manager. To run a script, first clone the repository and then, at the command line enter:

uv install

to install all neccessary dependencies.

To run one of the two main scripts run either:

uv run src/interop_dd.py

or

uv run src/networkgraph.py

interop_dd: Harvesting Snippets

The script to obtain text snippets is within the src/interoperability_deepdive folder. These scripts:

  1. Build the API URL -- gdd_snippets
  2. Page through the results list -- gddURLcall
  3. Process the JSON response from the API -- process_hits

The resulting object is a list of dict items, structured to be submitted to a CSV file with the following structure:

DOI highlight title resource
10.1016/j.epsl.2018.10.016 "identified using several publications (see supplementary information), the African Pollen Database, and" The roles of climate and human land-use in the late Holocene rainforest crisis of Central Africa African Pollen Database
10.1016/j.crte.2008.12.009 cm2 pe r year. The determination of 116 pollen taxa was made using the African Pollen Database reference Climate and environmental change at the end of the Holocene Humid Period: A pollen record off Pakistan African Pollen Database

From this table we can manually examine records to assess match quality.

Processing Results

To effectively build the network model for these resources we look to co-occurrence of resources in publications. For example:

DOI highlight title resource
10.1016/j.tree.2010.10.007 databases, most notably those of North American Pollen Database (NAPD), European Pollen Database (EPD) Exploring vegetation in the fourth dimension European Pollen Database
10.1016/j.tree.2010.10.007 reviewed data from 36 beetle assemblages from Britain that are held in the BugsCEP database (http://www.bugscep.com) and exploited the specific Exploring vegetation in the fourth dimension BugsCEP

Would indicate co-citation of the European Pollen Database and BugsCEP. A significant challenge in this analysis is knowing whether or not co-citation explicitly includes co-analysis of data (which may involve cross-walking and data translation). This work is challenging because of the structure of the xDeepDive API (which only returns individual "snippets", or sentences) and because of citation and data use patterns in publication.

Ultimately, the scripts first search for text strings, and then process the returned results (in the data/xdd_results folder) into a single csv file containing only DOIs that report more than one data resource.

The output data -- data/doi_centric.json -- is a JSON object with the following structure:

[
  {
    "doi": {
        "type": "string",
        "pattern": "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"
    },
    "resources": {
        "type": "array",
        "minLength": 2,
        "items": {
            "type": "string"
        }
    }
  }
]

Statistical Analysis

We care about several key measures: