Skip to content
camiplata edited this page Aug 13, 2018 · 2 revisions

Welcome to the Biodiversity Data Quality with OpenRefine wiki!

This wiki is under construction 🔨🔧

About SiB Colombia

Starting with Open Refine

Previous experience in open refine is not needed, to be ready to use the Biodiversity Data Quality scripts you only need to install OpenRefine and check how to upload your data.

Biodiversity Data Quality Scripts

Taxonomic Validation

Option 1, Automatic validation with gbif Species Matching Web

Procedure:

  1. Matches original data with species matching output (by id or scientific name)
  2. Retrieves GBIF's rank and status allowing the user to evaluate the state of each name
  3. Retrieves GBIF's higher taxonomy for all names
  4. Compares GBIF'S taxonomic suggestions with original taxonomy

Conditions:

  1. File obtained from speciesMatching GBIF named as 'normalized'
  2. Dataset with columns 'scientificName','scientificName',

Warnings:

  • The limit of GBIF speciesMatching web service in a single query is 6000 occurrences.
  • New data will be stored in columns at the beginning of the dataset

Option 2, Automatic validation with gbif API

Procedure:

  1. Matches original scientificName with GBIF's taxonomic Backbone
  2. Retrieves GBIF's rank and status allowing the user to evaluate the state of each name
  3. Retrieves GBIF's higher taxonomy for all names
  4. Compares GBIF'S taxonomic suggestions with original taxonomy using a boolean descriptor (1,0)

Conditions:

  • Dataset with minimum 'scientificName' column
  • To obtain a validation of higher taxonomy these elements are also required: 'kingdom','phylum','class','order','family','genus'

Important:

The Definitions of object/elements retrieve by GBIF's API may differ with those of the online tool SpeciesMatching

  • ScientificName: GBIF's scientific name matching the scientificName of the query
  • canonicalName: GBIF's canonicalName matching the scientificName of the query
  • species: GBIF's accepted name given the GBIF's scientific name matching the scientificName of the query

Conventions boolean descriptor

  • 0-GBIF's suggested name DOES NOT match the original name
  • 1-GBIF's suggested name matches the original name

Warnings:

  • New data will be stored in columns at the beginning of the dataset
  • Taxonomy elements are reorganized to facilitate the taxonomic validation

Temporal Validation

Procedure:

  1. Calls canadensys date Parsing API
  2. Cleans output for getting a clean JSON format
  3. Extracts ISO Date as text

Conditions:

  • Dataset with column name 'eventDate',

Warnings:

  • New data will be stored in columns at the beginning of the dataset
  • Review output for nulls, canadensys will not read all date formats

Geographical Validation

Procedure:

  1. Creates concatenated columns of geographic names
  2. Match single and concatenated columns with DIVIPOLA
  3. Returns matched names when matching was posible

Conditions:

  • Dataset with columns 'stateProvince','county','municipality'
  • DIVIPOLA archive, latest version provided by SiB Colombia

Warnings:

  • New data will be stored in columns at the beginning of the dataset
  • Review output (spMatch, spcMatch, spcmMatch)=blank, those rows needs to be fixed and standardized

Conventions:

  • spcm = stateProvince+County+Municipality
  • spc = stateProvince+County
  • sp = stateProvince