Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding @id for external URIs #74

Merged
merged 6 commits into from
May 28, 2020
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 142 additions & 20 deletions docs/1.1-DRAFT/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
See <https://w3id.org/ro/crate> for further details about RO-Crate.

<small>
This specification is Copyright 2017-2019 University of Technology Sydney,
This specification is Copyright 2017-2020 University of Technology Sydney,
The University of Manchester UK and the RO-Crate contributors.
</small>

Expand Down Expand Up @@ -99,7 +99,7 @@ _Contextual Entity_: A JSON-LD representation of an entity associated with a _Da

_Linked Data_: A data structure where properties, types and resources are identified with _URIs_, which if retrieved over the Web, further describe or provide the identified property/type/resource.

URI: A _Uniform Resource Identifier_ as defined in [RFC 3986], for example `http://example.com/path/file.html` - commonly known as _URL_. In this document the term _URI_ includes _IRI_, which also permit international Unicode characters.
_URI_: A _Uniform Resource Identifier_ as defined in [RFC 3986], for example `http://example.com/path/file.html` - commonly known as _URL_. In this document the term _URI_ includes _IRI_, which also permit international Unicode characters.

_URI Path_: The relative _path_ element of an _URI_ as defined in [RFC3986 section 3.3](https://tools.ietf.org/html/rfc3986#section-3.3), e.g. `path/file.html`

Expand Down Expand Up @@ -152,7 +152,7 @@ A valid _RO-Crate JSON-LD_ graph MUST describe:

It is RECOMMENDED that any referenced _contextual entities_ are also described in the _RO-Crate Metadata File_ with the same identifier. Similarly it is RECOMMENDED that any _contextual entity_ in the _RO-Crate Metadata file_ is linked to from at least one of the other entities using the same identifier.

The appendix [RO-Crate JSON-LD](#ro-crate-json-ld) details the general structure of the JSON-LD that is expected in the _RO-Crate Metadata File_. In short, the rest of this specification describe the different types of entities that can be added as `{}` objects to the _RO-Crate JSON-LD_ `@graph` below:
The appendix [RO-Crate JSON-LD](#ro-crate-json-ld) details the general structure of the JSON-LD that is expected in the _RO-Crate Metadata File_. In short, the rest of this specification describe the different types of entities that can be added as `{}` objects to the _RO-Crate JSON-LD_ `@graph` array below:

```json
{ "@context": "https://w3id.org/ro/crate/1.1-DRAFT/context",
Expand Down Expand Up @@ -194,7 +194,7 @@ If present in the root directory, `ro-crate-preview.html` MUST:

### Payload files and directories

These are the actual files and directories that make up the dataset being described.
These are the actual files and directories that make up the dataset being described.

The base RO-Crate specification makes no assumptions about the presence of any specific files or folders beyond the reserved RO-Crate files described above. Payload files may appear directly in the _RO-Crate Root_ alongside the _RO-Crate Metadata File_, and/or appear in sub-directories of the _RO-Crate Root_. Each file and directory MAY be represented as Data Entities in the _RO-Crate Metadata File_.

Expand All @@ -203,12 +203,13 @@ The base RO-Crate specification makes no assumptions about the presence of any s

A minimal RO-Crate is a directory containing a single _RO-Crate Metadata File_.

At the basic level, an RO-Crate is a collection of files represented as a schema.org [Dataset], that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The _RO-Crate Metadata File_ describes the RO-Crate, and MUST be stored in the _RO-Crate Root_. Self-containment is a core principle of RO-Crate, i.e. that all _Dataset_ files and relevant metadata SHOULD, as far as possible, be contained by the RO-Crate, rather than referring to external resources. However the RO-Crate MAY also reference external resources which are stored or accessed separately, via URIs, e.g. because these cannot be included for practical or legal reasons.
At the basic level, an RO-Crate is a collection of files and resources represented as a schema.org [Dataset], that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The _RO-Crate Metadata File_ describes the RO-Crate, and MUST be stored in the _RO-Crate Root_.

It is important to note that the _RO-Crate Metadata File_ is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that _do_ have full inventories / manifest and integrity checks, e.g. by using checksums, such as [BagIt] and Oxford Common File Layout [OCFL Object]s.
While RO-Crate is well catered for describing a _Dataset_ as files and relevant metadata that are _contained_ by the RO-Crate in the sense of living within the same root directory, RO-Crates can also reference external resources which are stored or accessed separately, via absolute URIs. This is particularly recommended where some resources cannot be co-hosted for practical or legal reasons, or if the RO-Crate itself is primarily web-based.

The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as [BagIt], [OCFL] and [git].
It is important to note that the _RO-Crate Metadata File_ is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that _do_ have full inventories / manifest and integrity checks, e.g. by using checksums, such as [BagIt] and Oxford Common File Layout [OCFL Object]s.

The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as [BagIt], [OCFL] and [git]. An RO-Crate can also be hosted on the web or mainly refer to web resources, although extra care to ensure persistence and consistency should be taken for archiving such RO-Crates.

## RO-Crate Metadata

Expand Down Expand Up @@ -253,7 +254,7 @@ Generally, the standard keys from [schema.org] should be used. However, RO-Crate
* `File` is mapped to <http://schema.org/MediaObject> which was chosen as a compromise as it has many of the properties that are needed to describe a generic file. Future versions of schema.org or a research data extension may re-define `File`.
* `Journal` is mapped to <http://schema.org/Periodical>.

Note that JSON-LD examples given on <http://schema.org/> website may not be in _flattened_ form; any nested entities in _RO-Crate JSON-LD_ SHOULD be described as separate contextual entities in the flat `@graph` list.
Note that JSON-LD examples given on the <http://schema.org/> website may not be in _flattened_ form; any nested entities in _RO-Crate JSON-LD_ SHOULD be described as separate contextual entities in the flat `@graph` list.

To simplify processing and avoid confusion with string values, the _RO-Crate JSON-LD Context_ requires URIs and entity references to be given in the form `"author": {"@id": "http://example.com/alice"}`, even where [schema.org] for some properties otherwise permit shorter forms like `"author": "http://example.com/alice"`.

Expand All @@ -280,6 +281,7 @@ The keys `RepositoryObject` and `RepositoryCollection` were chosen to avoid coll
From [Dublin Core Terms](http://purl.org/dc/terms/) RO-Crate use:
- `conformsTo` mapped to <http://purl.org/dc/terms/conformsTo>


### Summary of Coverage

RO-Crate is simply a way to make metadata assertions about a set of files and folders that make up a _Dataset_. These assertions can be made at three levels:
Expand Down Expand Up @@ -358,7 +360,7 @@ the the _Root Data Entity_ by following this algorithm:

1. For each entity in `@graph` array
2. ..if the `conformsTo` property is a URI that starts with `https://w3id.org/ro/crate/`
3. ....from this entity's `about` object keep the `@id` URI as variable _root_
3. ....from this entity's `about` object keep the `@id` URI as variable _root_
4. For each entity in `@graph` array
5. .. if the entity has an `@id` URI that matches _root_ return it

Expand All @@ -377,14 +379,13 @@ The _RO-Crate Metadata File Descriptor_ MAY contain information such as
licensing for the _RO-Crate Metadata File_ so metadata can be licensed
separately from Data.


The table below outlines the properties that the _Root Data Entity_ MUST have to be minimally valid and additionally highlights properties required to meet other common use-cases, including the minimum metadata necessary to mint a DataCite DOI:

#### Direct properties of the Root Data Entity

The _Root Data Entity_ MUST have the following properties:
* `@type`: MUST be [Dataset]
* `@id`: MUST be a a string of ‘./’
* `@id`: MUST end with `/` and SHOULD be the string `./`
fbacall marked this conversation as resolved.
Show resolved Hide resolved
* `name`: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates
* `description`: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important.
* `datePublished`: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond.
Expand Down Expand Up @@ -420,24 +421,28 @@ The following _RO-Crate Metadata File_ represents a minimal description of an _R
"description": "Palliative care planning for nursing home residents with advanced dementia ...",
"license": {"@id": "https://creativecommons.org/licenses/by-nc-sa/3.0/au/"}
},
{
{
"@id": "https://creativecommons.org/licenses/by-nc-sa/3.0/au/",
"@type": "CreativeWork",
"description": "This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Australia License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/au/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.",
"identifier": "https://creativecommons.org/licenses/by-nc-sa/3.0/au/",
"name": "Attribution-NonCommercial-ShareAlike 3.0 Australia (CC BY-NC-SA 3.0 AU)"
}
}
]
}
```

### Examples of referencing _Data Entities_ (files and folders) from the _Root Data Entity_
### Referencing _Data Entities_ (files and folders) from the _Root Data Entity_

Where files and folders are represented as _Data Entities_ in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the [hasPart] property. Directory hierarchies MAY be represented with nested Dataset _Data Entities_, or the Root Dataset MAY refer to files anywhere in the hierarchy using [hasPart].

_Data Entities_ representing files MUST be of `@type: File`, which is an RO-Crate alias for <http://schema.org/MediaObject>
_Data Entities_ representing files MUST be of `@type: File`, which is an RO-Crate alias for <http://schema.org/MediaObject>. The term _file_ here is liberal, and includes "downloadable" resources where `@id` is an absolute URI.

_Data Entities_ representing directories MUST be of `@type: Dataset`. The term _directory_ here includes HTTP file listings where `@id` is an absolute URI, however "external" directories SHOULD have a programmatic listing of their content (e.g. another RO-Crate).

_Data Entities_ representing directories MUST be of `@type: Dataset`.
_Data Entities_ can also be other types, for instance an online database. These SHOULD be of `@type: CreativeWork` and typically have a `@id` which is an absolute URI.

In all cases, `@type` MAY be an array in order to also specify a more specific type, e.g. `@type: [File, Workflow]`

_Note: as indicated above, there is no requirement to represent every file and folder in an RO-Crate as Data Entities in the RO-Crate JSON-LD._

Expand Down Expand Up @@ -558,21 +563,138 @@ If there is no web-accessible description for a file format it SHOULD be describ

### Core Metadata for _Data Entities_

The table below outlines the properties that Data Entities, when present, MUST have to be minimally valid .
The table below outlines the properties that Data Entities, when present, MUST have to be minimally valid.

#### File Data Entity

A `File` _Data Entity_ MUST have the following properties:
* `@type`: MUST be `File`, or an array where `File` is one of the values.
* `@id` MUST be a _URI Path_ relative to the _RO Crate root_

* `@id` MUST be either a _URI Path_ relative to the _RO Crate root_, or an absolute URI.

#### Directory File Entity

A `Dataset` (directory) _Data Entity_ MUST have the following properties:
* `@type` MUST be `Dataset` or an array wher `Dataset` is one of the values.
* `@id` MUST be a _URI Path_ relative to the _RO Crate root; SHOULD end with `/`
* `@id` MUST be either an a _URI Path_ relative to the _RO Crate root, or an absolute URI. The id SHOULD end with `/`

### Web-based Data Entities

While one use-case of RO-Crates is to describe _files_ contained within the _RO-Crate root_ directory, RO-Crates can also gather resources on the web, identified using _absolute URIs_ instead of relative _URI paths_.

Using Web-based data entities can be important particularly where a file can't be included in the _RO-Crate root_ because of licensing concerns, large data sizes, privacy, or where it is desirable to link to the latest online version.

Example of an RO-Crate where some of the _File Data Entities_ are external to the _RO-Crate root_:

```json
{ "@context": "https://w3id.org/ro/crate/1.1-DRAFT/context",
"@graph": [
{
"@type": "CreativeWork",
"@id": "ro-crate-metadata.jsonld",
"conformsTo": {"@id": "https://w3id.org/ro/crate/1.1-DRAFT"},
"about": {"@id": "./"}
},
{
"@id": "./",
"@type": [
"Dataset"
],
"hasPart": [
{
"@id": "survey-responses-2019.csv"
},
{
"@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf"
},
],
},
{
"@id": "survey-responses-2019.csv",
"@type": "File",
"contentSize": "26452",
"description": "Survey responses",
"encodingFormat": "text/csv"
},
{
"@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf",
"@type": "File",
"contentSize": "310691",
"description": "RO-Crate specification",
"encodingFormat": "application/pdf"
}
]
}
```

Additional care SHOULD be taken to improve persistence and longevity of web resources included
in an RO-Crate, as they can be more difficult to archive or move along with the _RO-Crate root_, and
may change intentionally or unintentionally leaving the RO-Crate with partial or outdated information.

File Data Entries with an `@id` URI outside the _RO-Crate Root_ SHOULD at the time of RO-Crate creation be directly downloadable by a simple retrieval (e.g. HTTP GET), permitting redirections and HTTP authentication. For instance in the example above, <https://zenodo.org/record/3541888> and <https://doi.org/10.5281/zenodo.3541888> cannot be used as `@id` above as retrieving these URLS give a HTML landing page rather than the desired PDF as indicated by `encodingFormat`.

A files on the web may change, the timestamp property `sdDatePublished` SHOULD be included to indicate when the absolute URL was accessed, and derived metadata like `encodingFormat` and `contentSize` was considered to be representative:

```json
{
"@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf",
"@type": "File",
"contentSize": "310691",
"encodingFormat": "application/pdf",
"sdDatePublished": "2020-04-09T13:09:21+01:00Z"
}
```

#### Embedded data entities that are also on the web

File Data Entities may already have a corresponding web presence, for instance a landing page that describes the file, including persistent identifiers (e.g. DOI) that goes to an intermediate HTML page instead of the downloadable file directly.

These can be included for File Data Entities as additional metadata, regardless of if the File is included in the _RO-Crate Root_ directory or exists on the Web, by using properties:

* `identifier` for formal identifier string such as DOIs
* `url` with a string URL for a *download* link (if not available, a download landing page) for this file
* `subjectOf` to a `CreativeWork` (or `WebPage`) that mentions this file or its content (but also other resources).
* `mainEntityOfPage` to a `CreativeWork` (or `WebPage`) that primarily describes this file (or its content)

```json
{
"@id": "survey-responses-2019.csv",
"@type": "File",
"encodingFormat": "text/csv",
"url": "http://example.com/downloads/2019/survey-responses-2019.csv",
"subjectOf": {"@id": "http://example.com/reports/2019/annual-survey.html"}
},
{
"@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf",
"@type": "File",
"encodingFormat": "application/pdf",
"identifier": "https://doi.org/10.5281/zenodo.3541888",
"url": "https://zenodo.org/record/3541888"
}
```

#### Directories on the web; dataset distributions

A _Directory File Entry_ or `Dataset` identifier by an absolute URL on the web can be harder to download than a `File` because it consists of multiple resources. It is RECOMMENDED that such directories have a complete listing of their content in `hasPart`, enabling download traversal.

Alternatively a common mechanism to provide downloads of a reasonably sized directory is as an archive file in formats like `.zip` or `.tar.gz`, described as a [DataDownload].

{
"@id": "lots_of_little_files/",
"@type": "Dataset",
"description": "This directory contains many small files, that we're not going to describe in detail.",
"name": "Too many files",
"distribution": {"@id": "http://example.com/downloads/2020/lots_of_little_files.zip"}
},
{
"@id": "http://example.com/downloads/2020/lots_of_little_files.zip",
"@type": "DataDownload",
"encodingFormat": "application/zip",
"contentSize": "82818928"
}

Similarly, the _RO-Crate root_ entity may also provide a `distribution` URL, in which case the download SHOULD be an archive that contain the _RO-Crate Metadata file_.

In all cases, consumers should be aware that a `DataDownload` is a snapshot that may not be reflecting the current state of the `Dataset` or RO-Crate.

## Representing _Contextual Entities_
The _RO-Crate JSON-LD_ @graph SHOULD contain additional information about _Contextual Entities_ for the use of both humans (in `ro-crate-preview.html`) and machines (in `ro-crate-metadata.jsonld`). This also helps to maximise the extent to which an _RO-Crate_ is self-contained and self-describing, in that it reduces the need for the consumer of an RO-Crate to refer to external information which may change or become unavailable over time.
Expand Down