Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect metadata field names and blocks to (de facto) standard ontologies #2357

Closed
bencomp opened this issue Jul 15, 2015 · 6 comments
Closed

Comments

@bencomp
Copy link
Contributor

bencomp commented Jul 15, 2015

For interoperability with the world outside Dataverse, each metadata field should have a public definition outside the code or database. While humans have a general understanding of the meaning of 'title', 'publication date' and 'description', for instance, machines need a pointer to a machine-actionable definition in order to understand the relationship between a thing and the values of the metadata fields.

This principle powers the Semantic Web and Linked Data and automated agents working on it. DANS would like to make the DataverseNL dataset metadata available as Linked Open Data in the future, but that is not the main use case. This touches on exporting metadata via OAI-PMH (#813) or as BibTeX (#1013) and other formats (#2116), embedding it in web pages as Schema.org (#2243) or in meta tags (#1393) and registering metadata for persistent identifiers (e.g. #24). It could help with API development (#899 links various ontologies already, #1430 could definitely benefit).

Ontologies for describing scholarly works have existed for a long time. The Dublin Core Terms are very general but widely used. DDI is well known, but a bit more geared towards specific types of scholarly research. Datasets can be described in the DataCite metadata schema, DCAT or other ontologies. For metadata blocks specific to certain dataverses (e.g. #2310) I'm sure there either is an ontology available or one could be created with little effort in the same way a metadata block is created. For general metadata, ontologies definitely exist. For specific metadata, you don't want to come up with fields (descriptors, properties) that only have meaning within Dataverse. (Let me throw in #27 for a link to general/specific metadata.)

The de facto way of publishing ontologies on the web in a machine-actionable format is using RDF Schema and/or OWL. Each property in an ontology gets a URI for identification and using that URI, its meaning and domain (and other aspects) can be described. These URIs and the domain should be used by Dataverse. The domain is important, because some fields don't describe the dataset itself, but related things like creator, publications that cite the dataset or specific files.

Although I'm not a fan of the way the TSV format is used to specify metadata blocks (which are essentially ontologies) and controlled vocabularies, you wouldn't need to get away from it to include fields' URIs and domain.

The use of ontologies (a.k.a. metadata schemas) for blocks and fields is complementary to using (de facto) standard controlled vocabularies for values, which I mentioned before (#947, #434).

(This is a follow-up from #2243 (comment) and discussions with @scolapasta and @pdurbin at IQSS in June 2015)

@mercecrosas
Copy link
Member

👍 I agree this is very relevant, and we had planned to do this with @posixeleni who create useful spreadsheets to map Dataverse metadata to several standards. I'm assign this issue to her. In particular we need metadata export support for:
General Metadata:
DataCite (already mapping in place), DCAT, Dublin Core Terms, RDF
Domain-specific Metadata:
DDI (social sciences)
ISA-tab (life sciences)
Virtual Observatory standards
and more to come

We'll also export metadata native json format

@pdurbin
Copy link
Member

pdurbin commented Jul 15, 2015

Once a given field such as "title" is flagged as being part of various standards (DataCite, DDI, etc.) it would be nice to be able to see which standards it's part of with an API call. Perhaps we could add it to this existing (but undocumented) API endpoint that I added back when I wanted more information about a given field for seach/indexing purposes:

$ curl -s http://localhost:8080/api/admin/datasetfield/title | jq .

{
  "status": "OK",
  "data": {
    "name": "title",
    "id": 1,
    "title": "Title",
    "metadataBlock": "citation",
    "fieldType": "TEXT",
    "allowsMultiples": false,
    "hasParent": false,
    "parentAllowsMultiples": "N/A (no parent)",
    "solrFieldSearchable": "title",
    "solrFieldFacetable": "title_s",
    "isRequired": true
  }
}

Also, I wanted to make sure people reading this issue know about the nice metadata reference at http://guides.dataverse.org/en/latest/user/appendix.html

Thanks for creating this ticket @bencomp . I plan to send it around to at least a few people who attended the API breakout meeting at the community meeting where we discussed some of this.

@bencomp
Copy link
Contributor Author

bencomp commented Jul 15, 2015

I'm glad you appreciate this input. I'll happily discuss with you and @posixeleni to make sure we understand each other's goals and requirements.

@mcrosas let me stress that RDF is a model. You need an ontology/ontologies to take the terms from in which you express properties of datasets (i.e. metadata). I like RDF a lot, even though I know it has limitations.
JSON is widely used for APIs, but interoperability with JSON APIs needs very good documentation of what the JSON represents.

@pdurbin as above, the value of JSON comes with documentation of inputs and outputs. If you want inline context, try JSON-LD :)
The metadata reference is a start - saying "Citation fields are compliant with DataCite and DDI" is not the same as "the metadata element 'title' has the same meaning as the element 'title' in the DataCite Metadata Schema v. 3.1".

When I was at IQSS, @scolapasta explained the DatasetField and DatasetFieldType system and I got the impression that it's pretty flexible in defining compound fields and such, but it only supports metadata fields that exist for a single dataset. This model is, with due respect, limiting. If, for example, the producerLogoURL that is consistently used in 100+ datasets changed (templates!), I would have to update all datasets, even though it had nothing to do with the dataset in the first place.
This comes back to (external) controlled vocabularies (mentioned in #947), which requires a different metadata model and handling of external metadata that is (just) beyond the scope of this issue.

@mercecrosas
Copy link
Member

@bencomp yes, I understand that RDF is a model, not the same as metatadata standards such as Dublin Core Tems or DataCite schema. Still important to support it.

Thanks for all the thorough descriptions in this issue; it will be very useful as we work on this.

@pdurbin
Copy link
Member

pdurbin commented Jun 29, 2017

I think this is a good idea but this issue hasn't attracted very many comments since it was opened two years ago. Closing. Please open a fresh one if anyone out there is still interested in this and I'll try to remember to link it back to this issue to see the conversation we had here.

@qqmyers
Copy link
Member

qqmyers commented Nov 11, 2020

Reviewing issues and thought I'd add a note. Metadatablocks now support specifying the URI for terms (and/or a default base URI for the block) which is then used in the creation of the OAI-ORE metadata export file, which is in-turn added to the Bags created for archiving. https://github.com/GlobalDataverseCommunityConsortium/dataverse/tree/IQSS/6497-semantic_api also has an updated metadata api that would allow for submitting metadata in json-ld format (i.e. the same format as in the OAI-ORE export).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants