Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RO-Crate "license" field should be URI #456

Open
stain opened this issue Nov 3, 2020 · 3 comments
Open

RO-Crate "license" field should be URI #456

stain opened this issue Nov 3, 2020 · 3 comments

Comments

@stain
Copy link
Collaborator

stain commented Nov 3, 2020

As pointed out in #183 (comment) the Workflow RO-Crate license field is in SEEK treated like a text field (e.g. "MIT"), however both https://www.researchobject.org/ro-crate/1.1/contextual-entities.html#licensing-access-control-and-copyright and https://schema.org/license says it should be a URL, meaning {"@id": "https://spdx.org/licenses/MIT"} or similar.

The Workflow RO-Crate license field should be valid and ideally also consistent with the full URIs in the generated schema.org annotations.

@stain
Copy link
Collaborator Author

stain commented Nov 3, 2020

The below tries to capture a Skype conversation on 2020-11-03 between @stain @fbacall @stuzart @alaninmcr

This would improve consistency with the schema.org generation, which currently use https://github.com/seek4science/seek/blob/master/lib/seek/license.rb to parse https://github.com/seek4science/seek/blob/master/public/od_licenses.json from https://licenses.opendefinition.org/licenses/groups/all.json aka https://opendefinition.org/licenses/api/

Although opendefinition.org and opensource.org use SPDX identifiers from https://spdx.org/ they have additional information such as if it's suitable for data or software - which is important in SEEK.

SPDX is in Open Source considered the gold standard for license information, particularly the short SPDX identifiers embedded inside source code files as comments (thus allowing different files to have different licenses):

# SPDX-License-Identifier: GPL-2.0-or-later

For packages or repositories SPDX also have something called SPDX documents which either have a simple yet unwieldy long text file, or an RDF document. In here we find spdx:licenseId which we could in theory use from RO-Crate as a way to equalize across @ids.

A potential gotcha here is that this would also allow license expressions e.g. for dual licensing or exceptions - so these don't always match straight up into URLs https://spdx.org/licenses/MIT.

Many of the licenses have their own URIs as well, so we could have many potential inconsistencies:

Some licenses like BSD 3-Clause are templates that needs to be completed with their own copyright. Thus it would be insufficient to say the license for SEEK itself to be https://spdx.org/licenses/BSD-3-Clause as it does not include copyright holders, rather it needs to be https://github.com/seek4science/seek/blob/master/BSD-LICENSE or even https://raw.githubusercontent.com/seek4science/seek/master/BSD-LICENSE or ideally even for a particular tag. This becomes tricky for workflows, as linking to the LICENSE on master of a versioned workflow would be a moving target - in that case the LICENSE should perhaps also be inside the RO-Crate and just be @id: LICENSE - in all cases of course this "custom" license would be hard to categories as BSD 3-Clause, say for faceted browsing.

https://github.com/spdx/license-list-data provide more Linked Data information from SPDX, which in theory could be combined with the https://licenses.opendefinition.org/licenses/groups/all.json data using the common SPDX identifier, for instance using the "licenseId" field https://github.com/spdx/license-list-data/blob/master/jsonld/BSD-3-Clause.jsonld#L6 which we could also use in the RO-Crate in combination with an arbitrary @id. However we don't really want arbitrary license IDs!

In short - the string license in Workflow RO-Crate is currently used to match the SEEK database text field, and we don't have to do deal with the many URI variants - but is inconsistent with the schema.org export (which use long URIs like https://opensource.org/licenses/MIT) and would need to be cleaned on RO-Crate import/export against Seek::License.

However it is not something we can lift into the main RO-Crate spec https://www.researchobject.org/ro-crate/1.1/contextual-entities.html#licensing-access-control-and-copyright as we can't refer to the Seek::License database - we need a third-party ground of truth (e.g. the Open Definitions JSON).

To get consistency the cleaner would need some kind of map of URIs aliased to SPDX identifiers. We should still document a list of "known" licenses by their identifier. We need something that maps the alternative urls to the opendefinition api which is what our license dropdown list is informed by.

@stuzart says as long as we do the lookups through the Seek::License class, then we can change the underlying json, or use the Spdx ruby gem just by changing things in one place and being consistent

@stain
Copy link
Collaborator Author

stain commented Nov 3, 2020

In Ruby land, https://www.rubydoc.info/gems/spdx/3.0.1 can parse expressions like MIT OR AGPL-3.0+ but nothing else it seems.

https://www.rubydoc.info/gems/spdx-licenses/1.2.0 seems to be able to look up from that JSON file but do not provide any information except if the license exist and is OSI compliant.

Our own https://github.com/seek4science/seek/blob/master/lib/seek/license.rb looks up and expose elements from the Open Definition JSON.

@stain
Copy link
Collaborator Author

stain commented Nov 3, 2020

Perhaps in RO-Crate http://schema.org/identifier can be used to give the SPDX value. As SPDX is "industry standard" it could go straight as string, being the implied scheme:

{ "@id": "workflow.cwl",
  "@type": "SoftwareSourceCode",
  "license": {"@id": "https://creativecommons.org/licenses/by/4.0/"},
},
{
  "@id": "https://creativecommons.org/licenses/by/4.0/",
  "@type": "CreativeWork",
  "name": "CC BY 4.0",
  "description": "Creative Commons Attribution 4.0 International License",
  "identifier": "CC-BY-SA-4.0"
}

A more scoped one using PropertyValue identifiers could cover an cover license expressions for a local file, AND say that they are SPDX-License-Identifier expressions, however becomes a bit cumbersome if it also needs to be used for simple cases:

{ "@id": "dual-licensed.py",
  "@type": "SoftwareSourceCode",
  "license": "LICENSE.txt",
},
{ "@id": "LICENSE.txt",
  "@type": "CreativeWork",
  "name": "MIT or AGPL 3.0 (or later)",
  "description: "Dual-licensed as MIT or AGPL 3.0",
  "identifier": "_:MIT-OR-AGPL"
},
{
   "@id": "_:MIT-OR-AGPL",
   "@type": "PropertyValue",
   "name": "SPDX-License-Identifier",
   "propertyID": "https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/",
   "value": "MIT OR AGPL-3.0+"
 }

stuzart added a commit that referenced this issue Apr 29, 2021
If logged out, takes you directly to the github issue tracker (configurable link).
If logged in, takes you to a page that directs the user to Github by preference, but also provides the feedback form as an alternative
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant