Home

Welcome to the Biodiversity Data Quality with OpenRefine wiki!

This wiki is under construction 🔨🔧

About SiB Colombia

Starting with Open Refine

Previous experience in open refine is not needed, to be ready to use the Biodiversity Data Quality scripts you only need to install OpenRefine and check how to upload your data.

Biodiversity Data Quality Scripts

Taxonomic Validation

Option 1, Automatic validation with gbif Species Matching Web

Procedure:

Matches original data with species matching output (by id or scientific name)
Retrieves GBIF's rank and status allowing the user to evaluate the state of each name
Retrieves GBIF's higher taxonomy for all names
Compares GBIF'S taxonomic suggestions with original taxonomy

Conditions:

File obtained from speciesMatching GBIF named as 'normalized'
Dataset with columns 'scientificName','scientificName',

Warnings:

The limit of GBIF speciesMatching web service in a single query is 6000 occurrences.
New data will be stored in columns at the beginning of the dataset

Option 2, Automatic validation with gbif API

Procedure:

Matches original scientificName with GBIF's taxonomic Backbone
Retrieves GBIF's rank and status allowing the user to evaluate the state of each name
Retrieves GBIF's higher taxonomy for all names
Compares GBIF'S taxonomic suggestions with original taxonomy using a boolean descriptor (1,0)

Conditions:

Dataset with minimum 'scientificName' column
To obtain a validation of higher taxonomy these elements are also required: 'kingdom','phylum','class','order','family','genus'

Important:

The Definitions of object/elements retrieve by GBIF's API may differ with those of the online tool SpeciesMatching

ScientificName: GBIF's scientific name matching the scientificName of the query
canonicalName: GBIF's canonicalName matching the scientificName of the query
species: GBIF's accepted name given the GBIF's scientific name matching the scientificName of the query

Conventions boolean descriptor

0-GBIF's suggested name DOES NOT match the original name
1-GBIF's suggested name matches the original name

Warnings:

New data will be stored in columns at the beginning of the dataset
Taxonomy elements are reorganized to facilitate the taxonomic validation

Temporal Validation

Procedure:

Calls canadensys date Parsing API
Cleans output for getting a clean JSON format
Extracts ISO Date as text

Conditions:

Dataset with column name 'eventDate',

Warnings:

New data will be stored in columns at the beginning of the dataset
Review output for nulls, canadensys will not read all date formats

Geographical Validation

Procedure:

Creates concatenated columns of geographic names
Match single and concatenated columns with DIVIPOLA
Returns matched names when matching was posible

Conditions:

Dataset with columns 'stateProvince','county','municipality'
DIVIPOLA archive, latest version provided by SiB Colombia

Warnings:

New data will be stored in columns at the beginning of the dataset
Review output (spMatch, spcMatch, spcmMatch)=blank, those rows needs to be fixed and standardized

Conventions:

spcm = stateProvince+County+Municipality
spc = stateProvince+County
sp = stateProvince

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

About SiB Colombia

Starting with Open Refine

Biodiversity Data Quality Scripts

Taxonomic Validation

Option 1, Automatic validation with gbif Species Matching Web

Option 2, Automatic validation with gbif API

Temporal Validation

Geographical Validation

Clone this wiki locally