Examining Interoperability Among Databases

This project leverage xDeepDive to understand the intersection between existing paleo-data community resources such as the Neotoma Paleoecology Database, Global Paleofire Database or WorldClim. We use a list of terms compiled from researcher interviews that indicate likely sources of data used by researchers working in Holocene/Quaternary studies across a range of disciplines. Included in this set of terms are various tools associated with these different data resources.

The full list of terms and links used in the xDeepDive snippets search is in the data folder. We have 51 different resources identified from the interviews and 148 unique terms associated with these resources. Terms include URLs (e.g., neotomadb.org for the Neotoma Paleoecology Database), programming libraries (e.g., rgbif for the Global Biodiversity Information Facility) and alternate names, including initializations (e.g., APD for the African Pollen Database).

Using the xDeepDive snippets API we search for these terms to build a large table of DOIs, text snippets and database terms. Initial testing shows this table to be quite large (>100k rows), in part because some resources have low specificity in their naming.

Contributors

This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.

Simon Goring

Tips for Contributing

Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through pull requests to project forks or project branches.

All products of the Neotoma Paleoecology Database are licensed under an MIT License unless otherwise noted.

Using the Repository

The repository is coded in Python and managed using the uv Package Manager. To run a script, first clone the repository and then, at the command line enter:

uv install

to install all neccessary dependencies.

To run one of the two main scripts run either:

uv run src/interop_dd.py

or

uv run src/networkgraph.py

`interop_dd`: Harvesting Snippets

The script to obtain text snippets is within the src/interoperability_deepdive folder. These scripts:

Build the API URL -- gdd_snippets
Page through the results list -- gddURLcall
Process the JSON response from the API -- process_hits

The resulting object is a list of dict items, structured to be submitted to a CSV file with the following structure:

DOI	highlight	title	resource
10.1016/j.epsl.2018.10.016	"identiﬁed using several publications (see supplementary information), the African Pollen Database, and"	The roles of climate and human land-use in the late Holocene rainforest crisis of Central Africa	African Pollen Database
10.1016/j.crte.2008.12.009	cm2 pe r year. The determination of 116 pollen taxa was made using the African Pollen Database reference	Climate and environmental change at the end of the Holocene Humid Period: A pollen record off Pakistan	African Pollen Database

From this table we can manually examine records to assess match quality.

Processing Results

To effectively build the network model for these resources we look to co-occurrence of resources in publications. For example:

DOI	highlight	title	resource
10.1016/j.tree.2010.10.007	databases, most notably those of North American Pollen Database (NAPD), European Pollen Database (EPD)	Exploring vegetation in the fourth dimension	European Pollen Database
10.1016/j.tree.2010.10.007	reviewed data from 36 beetle assemblages from Britain that are held in the BugsCEP database (http://www.bugscep.com) and exploited the specific	Exploring vegetation in the fourth dimension	BugsCEP

Would indicate co-citation of the European Pollen Database and BugsCEP. A significant challenge in this analysis is knowing whether or not co-citation explicitly includes co-analysis of data (which may involve cross-walking and data translation). This work is challenging because of the structure of the xDeepDive API (which only returns individual "snippets", or sentences) and because of citation and data use patterns in publication.

Ultimately, the scripts first search for text strings, and then process the returned results (in the data/xdd_results folder) into a single csv file containing only DOIs that report more than one data resource.

The output data -- data/doi_centric.json -- is a JSON object with the following structure:

[
  {
    "doi": {
        "type": "string",
        "pattern": "/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i"
    },
    "resources": {
        "type": "array",
        "minLength": 2,
        "items": {
            "type": "string"
        }
    }
  }
]

Statistical Analysis

We care about several key measures:

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
img		img
src		src
.gitignore		.gitignore
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Examining Interoperability Among Databases

Contributors

Tips for Contributing

Using the Repository

`interop_dd`: Harvesting Snippets

Processing Results

Statistical Analysis

About

Releases 1

Packages

Languages

License

NeotomaDB/Interoperability_DeepDive

Folders and files

Latest commit

History

Repository files navigation

Examining Interoperability Among Databases

Contributors

Tips for Contributing

Using the Repository

interop_dd: Harvesting Snippets

Processing Results

Statistical Analysis

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`interop_dd`: Harvesting Snippets

Packages