-
Notifications
You must be signed in to change notification settings - Fork 46
RDFization Guide
The linked data that forms part of Bio2RDF ascribes a to simple set modeling patterns that permit our different datasets to syntactically interoperate. The best practices here presented have been inspired by the Banff Manifesto, Tim Berner-Lee's design principles and the collective experience of our community. This document intends to provide a clear set of guidelines that will guide Bio2RDF users and contributors in understanding how to use and create Bio2RDF compatible linked data. Comments and suggestions are always welcome, join our maling list to get more involved!
All URI's that form part of Bio2RDF linked data.
Normalized URI: http://bio2rdf.org/public_database:private_identifier
Consider for example a UniProt record i.e.: P26838. The proposed URI for this record would be the same as its URL:
http://bio2rdf.org/uniprot:P26838
Blank nodes should be avoided like the plauge.
Every resource must contain the following metadata with their corresponding predicates:
The first step of the RDFization process involves the use of a consistent identifier identifier scheme. Data providers such as NCBI, EBI, etc. use unique identifiers to refer to the entities that they are hosting. The linked data that forms part of Bio2RDF distinguishes between those identifiers that refer to the original hosted entities and any other auxliary identifiers used in the creation of the linked data graph
For every unique entity c to a record Bio2RDF identifiers are given by the following URI pattern:
http://bio2rdf.org/''namespace'':''identifier''
where the namespace is a short name listed in our dataset registry that uniquely identifies the source (dataset/database). The identifier is the (alpha)numeric string assigned to identify that entity. For instance, the gene identified by the number 15275 in the NCBI EntrezGene Database (namespace = geneid) has the following identifier:
<code>http://bio2rdf.org/''geneid'':''15275''</code>
The Bio2RDF URI scheme is applied not just to data entries, but also for the vocabulary (types and relations) to describe these entries.
<code>http://bio2rdf.org/''namespace''_term:''term''</code>
For example, the gene identified by geneid:15275 is a kind of Gene, as defined by Entrez Gene.
<code>http://bio2rdf.org/''geneid''_term:''Gene''</code>
Each resource should contain the following annotations:
<code>http://purl.org/dc/terms/title</code> a human readable title as it appears in the source data.
<code>http://purl.org/dc/terms/identifier</code> a string that contains the identifier using the following pattern <namespace>:<identifier>
<code>rdfs:label</code> a Bio2RDF generated label containing a title followed by the identifier "title [ns:id]".
Used by convention in most RDF browsers to render the name of resource instead of its URI.
Taken together,
<code> geneid:15275 rdfs:label "Hk1 [geneid:15275]" ; dc:title "Hk1" ; dc:identifier "geneid:15275" ; rdf:type geneid_term:Gene . </code>
We recognize a minimum of 3 entities found in biological information resources: physical entities, records and datasets.
1. Record
Records are information objects that contain a set of statements, primarily about the subject.
<code> namespace_record:identifier bio2rdf_term:has-primary-subject namespace:identifier . </code>
<code> namespace:identifier bio2rdf_term:is-described-by namespace_record:identifier . </code>
2. Dataset Datasets are collections of records.
<code> bio2rdf_dataset:<namespace> bio2rdf_term:has-item namespace_record:identifer . </code>
Since datasets can be versioned, we
<code> bio2rdf_dataset:namespace.version dc:hasVersion "13" ; dc:partOf bio2rdf_dataset:namespace . </code>
this section is about how to create mappings from your dataset specific vocabulary to SIO.