-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement New Schema management system #3761
Comments
I like the approach being suggested here, however I do have some questions surrounding the proposed
|
|
Update - after further discussion I am thinking more and more that at least for initial rollout it makes more sense to just have a few basic schema names -- dataset and distribution, plus anything like data-dictionary which is internal. The addition of "class" seems too complex -- if necessary we can find a way to "alias" these core types if people really want them to appear differently in the API paths. I also don't think it's necessary to require "literals" to be flagged in a schemas file; simply having a schema be of type "string" should be enough. Will update the above plan/diagram when possible. |
First pass at this - get the yaml in place and start using to compose schema.
|
The way schemas are defined and discovered in DKAN is still quite fragile, and not conducive to customization. Flexibility is the major value proposition of DKAN's unorthodox approach to metadata storage in DKAN, so this is technical debt that would be very useful to pay off, in terms of stability, developer experience and ease of adoption.
The current schema system infers a set of schemas by simply looking at any JSON files that exist in one of a few possible filesystem locations. There is no way to differentiate between core, required schemas and user-defined schemas -- if the user has a custom schema directory all core schemas will be ignored. As well, there is an assumption that schema filenames will match their machine names in the system and the property name they are referenced from, and that there will necessarily be a dataset and distribution schema with a few expected properties. Finally, there is no differentiation between the actual metastore schemas and the form-specific ui schemas, but rather we again relay on filename-based conditionals to chose one or the other when appropriate.
After thinking through a number of approaches I'm proposing what I believe is the best solution to this problem. It is less of a break from the current paradigm than other ideas in this direction (implementing schemas as PHP classes, or as Drupal entity bundles), has what I think is a better separation of concerns between PHP code, declarative definitions in YAML, and schema documents in JSON. It gives us a unified way to abstract the non-negotiable parts of schemas in the DKAN business logic and allow everything else to be fully override-able. This solves some inconsistencies in the referencing and storing logic between datasets, refrenced items like distributions and keywords, and "resources" (files).
This is accomplished through a standardized set of reference types and schema classes. The latter are equivalent to what we have been calling schema "behaviors" in previous planning documents. They allow us to create universal patterns in our DKAN methods to replace some very brittle conditionals that will begin to break down as soon as anyone significantly alters the default DCAT-US based metastore schemas.
Overview
A new schema system would define schemas in .yaml files in a module's root directory, similar to Drupal services and routes. No configuration is guessed/inferred simply from the presence or lack of certain files in the filesystem. The intention is for schemas to be tightly coupled to specific, focused modules.
The current admin pages where reference and trigger fields are set will be removed, as these are all covered in the new schemas file.
This will of course be a significant upgrade of DKAN and will involve these changes:
Structure of a DKAN schemas file
Similar to Drupal's .services.yml files, a .schemas.yml file in a module's root directory will register schemas with DKAN and make them available to
Drupal\metastore\SchemaReriever
. The default DCAT-AP schemas can be described like this:Schema definitions
catalog
,dataset
,distribution
,dictionary
, andliteral
.uuid
supported which injects the internal item uuid as a value. Other types may be made available in the future.catalog
class.file: Possible future reference type if we replace the resource system with DKAN's core file entities, which has been proposed a few times recently.file
.datastore_import
, which will re-import the datastores associated with this item, is available.Classes
Schema classes are not references to PHP classes, but map roughly to classes from DCAT and related RDF vocabularies. DKAN can work with essentially any metadata schema as long as it maps to these basic DCAT concepts.
catalog
: Schema used for custom catalog endpoint, usually data.json. Equivalent todcat:Catalog
. There may be only one catalog record in thedataset
: The main dataset schema in the metastore. Equivalent todcat:Dataset
distribution
: A specific representation of the dataset. Usually, a file. Equivalent todcat:Distribution
.dictionary
: A schema with column-level metadata, such as a table schema or a shared data dictionary. For now, implemented in core and there is no support for standards other than the Frictionless Table Schema. Closest equivalent would bedct:Standard
, though in DKAN this class is much more specific.literal
: A "schema" whose items are unstructured values, such as strings. Literals will be wrapped in a JSON-LD structure. For instance, a keyword "health" would be stored as a simple string in the database, but retrieved from the API with the JSON{"@value": "health"}
. Equivalent tordfs:Literal
and similar RDF literal types.Schemas may map to other DCAT classes (for instance,
foaf:Organization
in the case of the defaultorganization
schema) but the play no special role in DKAN's architecture and do not need to be assigned a class in the schemas file.For the moment, there can only be one schema defined for any class other than
literal
. In future iterations we may want to make this more flexible to address use-cases where multiple types of datasets or distributions are supported -- for example, a catalog might allow different schemas for geospatial or financial metadata.Typical class relationships in a DKAN catalog:
Related changes
Specing out this work has provided some clarity on we could better handle identifiers and references in the API, will document this in a second issue.
The text was updated successfully, but these errors were encountered: