Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend basisOfRecord vocabulary #84

Open
adam-collins opened this issue May 31, 2021 · 13 comments
Open

Extend basisOfRecord vocabulary #84

adam-collins opened this issue May 31, 2021 · 13 comments

Comments

@adam-collins
Copy link

Please consider the following additions to basisOfRecord vocabulary.

  1. Include subset of ALA terms.
  2. Alignment with current DwC.
Proposed addition Reasoning
Event https://dwc.tdwg.org/list/#dwc_Event
Taxon https://dwc.tdwg.org/list/#dwc_Taxon
EnvironmentalDNA example data resource https://collections.ala.org.au/public/show/dr10487
GenomicDNA example data resource https://collections.ala.org.au/public/show/dr658
NomenclaturalChecklist example data resource https://collections.ala.org.au/public/show/dr647

Can I get clarification on gbif-api-0.49 basisOfRecord terms Unknown and Observation. I read them as equivalent in the document https://rs.gbif.org/vocabulary/dwc/basis_of_record.xml. Is there a relationship between these and the dwc:basisOfRecord example Occurrence?

Reference
https://rs.gbif.org/vocabulary/dwc/basis_of_record.xml
http://rs.tdwg.org/dwc/terms/basisOfRecord
https://support.ala.org.au/support/solutions/articles/6000197141-what-is-the-basis-of-record-

gbif-api-0.49 dwc:basisOfRecord ALA
PreservedSpecimen PreservedSpecimen PreservedSpecimen
FossilSpecimen FossilSpecimen FossilSpecimen
LivingSpecimen LivingSpecimen LivingSpecimen
MaterialSample MaterialSample MaterialSample
HumanObservation HumanObservation HumanObservation
MachineObservation MachineObservation MachineObservation
Literature Literature
Event
Taxon
Observation (same as Occurrence?) Occurrence
Unknown (same as Occurrence?)
Germplasm
Image
NomenclaturalChecklist
RegionalChecklist
Sound
Video
GenomicDNA
EnvironmentalDNA
@timrobertson100
Copy link
Member

timrobertson100 commented May 31, 2021

Thanks @adam-collins

I've asked @tucotuco to comment on this from a Darwin Core perspective. Some of the proposals in here mix up the original intention for basisOfRecord and type (e.g. Sound, MovingImage (video)) and others are really a dataset type (e.g. RegionalChecklist). BasisOfRecord is a terribly overloaded term, and the original intended use is not really what people expect of it.

There is a related thread here proposing to bring in richer dataset categories, which would be carried over to occurrences to aid in filtering. That approach was proposed to avoid breaking BOR for others (e.g. those who rely on it to infer the class definitions). My own feeling is that is the better way to accommodate the intention I assume is behind this request.

Can I get clarification on gbif-api-0.49 basisOfRecord terms Unknown and Observation.

Observation is the superset of machine and human observations (i.e. observed but no physical evidence taken), while unknown and occurrence are effectively equivalent, dating back to the early DwC days (pre-2009 edition). Literature is also a leftover from very old data.

@mdoering
Copy link
Member

mdoering commented May 31, 2021

For NomenclaturalChecklist and RegionalChecklist GBIF uses the DatasetSubtype vocabulary. We do not apply BoR to Taxon records, even though DwC places BoR on record level and thus allows to do so.

@tucotuco
Copy link

From the DwC perspective, BoR was originally meant to designate which of the Darwin Core classes was the primary perspective upon which a view was based - the one in a one-to-many relationship between csv-encoded tables. The primary view of interest was with Occurrence as the core because specimens were the first record type shared with the proto-Darwin Core.

Nevertheless it was anticipated that the view could be "inverted" (e.g., with Event-centered Occurrences with Event as a Core) or partial (e.g,, a gazetteer with Location as the Core, or a nomenclatural checklist with Taxon as the core). In order to distinguish Occurrences where vouchers existed from those where they didn't, subtypes of Occurrence (PreservedSpecimen, FossilSpecimen, and LivingSpecimen) were created.

At the same time it seemed useful to also have subtypes for the evidence that remained from observation-based Occurrences. These already existed and were borrowed from the Dublin Core type vocabulary (StillImage, MovingImage, Sound). as concrete subtypes of an abstract Observation class, and all in a Darwin Core type vocabulary with namesespace dwctype: as a formal controlled vocabulary for basisOfRecord. All of these were part of a formal vocabulary similar to the DCMI vocabulary for dc:type.

One of the problems with the type vocabulary was that we were incorrectly mixing the type vocabularies for dctype: and dwctype:. A second problem was the recognition from the outset that there would be community pressure to diversify the basisOfRecord values for ever more specific categories. We remain with yet a third problem, which is the tendency to use the basisOfRecord as the "Evidence" for Occurrences, when most of the time the evidence falls into many categories.

To avoid the first of the issues above and and pave the way for a solution to the second problem, the Darwin Core type vocabulary was deprecated and classes were created in the dwc: namespace for those that didn't already exist. dc:type and dwc:basisOfRecord were both included in the Darwin Core list of terms, where dc:type was properly controlled by dctype: classes and basisOfRecord was like other terms in Darwin Core insofar as it now had the recommendation to use a controlled vocabulary, specifically consisting of the Darwin Core classes, also reflected in the Examples given.

So, now we have dc:type to contain the Dublin Core type values (StillImage, MovingImage, Sound, PhysicalObject, Event, and Text) and dwc:basisOfRecord to contain the Darwin Core type values (PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, and Occurrence). No one has yet done anything where the basisOfRecord is any of the other existing Darwin Core classes, though a new class MaterialCitation is on the verge of being born.

Because basisOfRecord is not actually controlled by a controlled vocabulary as it once was (it is only a recommendation), the door is open for using other values. However, I would urge against doing so without good reason, and without creating new terms for the same thing, and without ultimately adding them to the standard. Here is how I would map the terms across all four categories.

I hope this helps somehow.

ALA GBIF dwc:basisOfRecord dc:type
PreservedSpecimen PreservedSpecimen PreservedSpecimen PhysicalObject
FossilSpecimen FossilSpecimen FossilSpecimen PhysicalObject
LivingSpecimen LivingSpecimen LivingSpecimen PhysicalObject
MaterialSample MaterialSample MaterialSample PhysicalObject
HumanObservation HumanObservation HumanObservation dcmitype:Event
MachineObservation MachineObservation MachineObservation dcmitype:Event
Sound MachineObservation MachineObservation Sound
Image MachineObservation MachineObservation StillImage
Video MachineObservation MachineObservation MovingImage
Germplasm LivingSpecimen LivingSpecimen PhysicalObject
GenomicDNA MaterialSample MaterialSample PhysicalObject
EnvironmentalDNA MaterialSample MaterialSample PhysicalObject
NomenclaturalChecklist Taxon [1] Taxon None
RegionalChecklist Taxon [1] Occurrence [2] dcmitype:Event
Literature Literature MaterialCitation Text
Observation Occurrence dcmitype:Event
Unknown Occurrence dcmitype:Event
Occurrence dcmitype:Event
Organism PhysicalObject
dwc:Event [3] dcmitype:Event
Location None
GeologicalContext None
Identification None
ChronometricAge None

[1] As @mdoering said, The Taxon Core definition does not include a basisOfRecord term, the type is designated in the dataset metadata.
[2] Regional Checklists are actually data sets about the existence of Taxa at a particular place and time, and are therefore actually Occurrences, not Taxa.
[3] The Event Core definition does not include a basisOfRecord term, the type is designated in the dataset metadata. There is pressure from multiple fronts to include either an eventType in Darwin Core, or to include basisOfRecord in the Event Core definition and add new Classes that are more specific than dwc::Event.

@timrobertson100
Copy link
Member

timrobertson100 commented Jun 1, 2021

Thank you for clarifying the expectations based on DwC @tucotuco.

Where does this leave you @adam-collins, please? I assume you have these BORs to allow users to filter in/out data from e.g. eDNA studies which would not be accommodated by strictly following DwC, also why I started the dataset categorization thread.

Are the ALA already committed to the BOR you have please? (e.g. for backwards compatible reasons)

@javier-molina
Copy link

Hi @timrobertson100

At this stage we are evaluating changing some of our datasets (eDNA) to be accommodated within the existing fields and vocabularies, pretty much the table that tucotuco suggests above subject to the ability to search/filter. @peggynewman is working out a possible mapping strategy for it.

Our preference is not to add backwards compatibility as that will deviate from already provided GBIF pipelines implementation.

@elywallis
Copy link

Chiming in with a couple of points specifically on eDNA
@timrobertson100 you're correct that this (for me anyway) is all about allowing users to quickly and easily filter in/out eDNA records.

  • currently the ALA BasisOfRecord environmentalDNA is being processed to UNKNOWN in GBIF (also GenomicDNA). Using UNKNOWN is very unhelpful. If the term environmentalDNA has to be processed out...
  • could this please be changed to align with @tucotuco suggestion that environmentalDNA maps to MaterialSample

Mapping to materialSample still creates issues for ALA - user perspective is:

  • currently 1.17M records have BoR = environmentalDNA
  • 175K records have BoR = materialSample
    These 175K records are (all but a few) for tissue samples provided by museums, intended eventually to pass through to GGBN
  • the 175K tissue samples have entries in Preparations e.g. skin, muscle, heart, lung etc
  • the 1.17M eDNA records have no value in Preparations
  • Preparations does not appear in the ALA facet list at the moment so further dev work will need to be undertaken to allow users to do that 2 stage faceting to get just tissue samples, or just eDNA records
  • currently it's easy to filter using the BoR terms to pick either the tissue samples or the eDNA in one go
  • if eDNA records are now mapped to BoR MaterialSample then it will be impossible to use a single filter/facet to distinguish between eDNA and tissue samples. From a user perspective this is a poor outcome, even if it might be a logical thing to choose to do from a standards perspective.

I think that occurrence data derived from eDNA sampling may massively increase in the next few years. I also know that there are plenty of other discussions going on about BasisOfRecord. I would just like to advocate for users here and emphasise that any of the processing that's so far been proposed will make it more difficult for users to find the data they want, not easier.

@dagendresen
Copy link

If MaterialSample maybe becomes a superclass for both eDNA and specimens see tdwg/dwc#314 then a "child" class for eDNA samples (and different from the potential "parent" term MaterialSample might be desired?

@timrobertson100
Copy link
Member

Thanks, @elywallis - we'll tackle the first request and you can watch progress here

I also feel we need to focus effort on those looking to extract subsets of data from indexes like GBIF and ALA. My worry is that today you need to understand the idiosyncrasies across several terms, while I suspect most just want a set of checkboxes to filter in/out data at broad categories (e.g. has preserved evidence and originates from an eDNA study). We also need to provide general metrics at that level. Could you please look at the dataset categories here to see if this aligns with the use cases you see, noting that these would be multivalue options? Dataset category is one way we might approach this, but we could also look to other options.

@timrobertson100
Copy link
Member

If MaterialSample maybe becomes a superclass for both eDNA and specimens see tdwg/dwc#314 then a "child" class for eDNA samples (and different from the potential "parent" term MaterialSample might be desired?

Thanks @dagendresen. I don't disagree with this but I do have two thoughts

  1. The first is just a practical one, that 314 is unlikely to be resolved anytime soon, will require a TDWG task group and even then may remain controversial given the disagreement of lumping preserved specimen and environment material as the same "thing". While this plays out, the users suffer.
  2. BasisOfRecord is a confusing concept when used with Darwin Core Archives as the rowType is what defines the class of concept, while basisOfRecord is aiming to do something similar and why people look to it to provide finer-grained classification.

My own feeling is that any change to BOR is really a band-aid to the real problem, which is that we force everything through an occurrence/event model in a star schema. This is why I think focusing on the nature of datasets is worthwhile, since it is orthogonal to whatever happens in DwC and allows us to focus on the filtering needs. It's also why we're starting to consider more expressive models in GBIF (more on that shortly).

@dagendresen
Copy link

Thanks @timrobertson100

I imagine a possible MaterialSample Core (or similar), and that dwc:PreservedSpecimen (etc) and eDNA samples to maybe become organized as distinct "things" under dwc:MaterialSample

... and thus that a new filter category label for eDNA samples might perhaps want to avoid using the label from dwc:MaterialSample in the risk that this might aquire a superclass meaning...??

MattBlissett added a commit to gbif/parsers that referenced this issue Jun 3, 2021
@tucotuco
Copy link

tucotuco commented Jun 3, 2021 via email

@dagendresen
Copy link

dagendresen commented Jun 3, 2021

@tucotuco I do not think that a MaterialSample "extension" will be a full solution. I think that a MaterialSample "core" (in a DwC-A format) would be needed... (Specimens are NOT Occurrences) Maybe this is the same as you are saying?

@tucotuco
Copy link

tucotuco commented Jun 3, 2021

@dagendresen I understand and agree. There is confusion in terminology involved. That "extension" is for a MaterialSample Core. You can see that the Occurrence, Event, and Taxon Cores are all called extensions as well (https://tools.gbif.org/dwca-validator/extensions.do).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants