Skip to content
This repository has been archived by the owner on Jan 27, 2022. It is now read-only.

understanding stencila/schema architecture #62

Closed
100ideas opened this issue May 15, 2019 · 1 comment
Closed

understanding stencila/schema architecture #62

100ideas opened this issue May 15, 2019 · 1 comment

Comments

@100ideas
Copy link
Collaborator

Hey yall, I'm working on an experimental schema-autosuggest frontend interface that helps a user consume, transform, mashup, & remap data tables. I've been reviewing the source code of stencila & stencila/schema to see how I might implement something that will be as broadly useful as possible.

I am really torn about json-schema & json-ld. Ultimately I want to use both, like you are, to ensure data workflows can be serialized and reused in an unambiguous & repeatable manner. But my design intent is to allow the user to be as initially unconstrained as possible as they create and structure hierarchies of tabular data, nudging them towards standard / published schemas without requiring them to restructure their raw data before doing anything else. What I need to do - in the frontend experience - is help users explore various ways of mashing up, overriding, and fragmenting existing json-schema as they build a data processing workflow, then reconcile the resulting definitions with the preexsiting ones in the most parsimonious (least redundant) way.

It seems like stencila/schema has an architecture designed to support modular, hierarchical, reusable schema definitions, and I'd like to know more about how this particular approach was developed and how the major parts of it work together.

I can roughly see that schemas are initially defined in lightweight yaml - that's nice - and then compiled into a hierarchical json-schema (and ts-definitions) at runtime. In particular schemas can [extend](https://github.com/stencila/schema/blob/master/CONTRIBUTING.md#the-extends-keyword) other schemas, setting up a class-like inheritance mechanism. Why did you decide to design it this way, and would it have been possible to use native json-schema $refs or json hyper-schema links instead? Where those options too verbose / user-unfriendly?

(I initially posted this over in the stencila gitter room but it got a bit involved so I think this might be a better place for it)

@nokome
Copy link
Member

nokome commented Jun 14, 2019

@100ideas apologies for taking so long to respond to this. I got caught up with various other things and forgot to come back to it.

an experimental schema-autosuggest frontend interface that helps a user consume, transform, mashup, & remap data tables.

That sounds super cool, and useful! I'd be interested in finding out more.

I am really torn about json-schema & json-ld. Ultimately I want to use both, like you are, ...

I initially found it difficult to reconcile how these two technologies fit and complement each other. Personally, I think that schema.org is a poor choice of name, and that it adds to the confusion. Because JSON-LD is closely aligned to schema.org (although, of course, one can use another vocab), it's easy for ppl to think that JSON-LD is an alternative to JSON-Schema. It became clearer in my mind how they complemented each other when I started thinking about JSON-Schema being for data modelling and validation, JSON-LD as a mechanism for mapping between vocabularies, and schema.org as one possible vocabulary. Maybe that is more obvious at the outset for some people but it wasn't for me.

It seems like stencila/schema has an architecture designed to support modular, hierarchical, reusable schema definitions, and I'd like to know more about how this particular approach was developed and how the major parts of it work together.

The approach evolved. It begun by us having an implicit, not-well documented schema for transferring data (e.g. tabular, column-wise data) between languages. We realised that we were probably reinventing the wheel there so looked at other schemas such as Avro. We were also documenting our code execution API using OpenAPI (which uses JSON-Schema). We then realised that using JSON-Schema for everything in an executable document (like Jupyter Notebooks but with finer granularity e.g Heading, Paragraph etc nodes instead of Markdown strings) was a useful approach to model, validate and document our API.

We originally took a Typescript-first approach and were defining schemas using Typescript classes with decorators on properties and then generating JSON Schema from them. That was fine but we decided to invert the relationship to be language agnostic, and more data modelling and validation focused.

We use a custom @id property to generate the semantic mapping (i.e. the JSON-LD @context) between types and properties in the JSON-Schema definitions and types and properties in schema.org and other vocabularies. We try to use schema.org as much as possible, so many of the schema definitions here are simply a JSON-LD interpretation of a schema.org type (e.g. CreativeWork).

I can roughly see that schemas are initially defined in lightweight yaml - that's nice - and then compiled into a hierarchical json-schema (and ts-definitions) at runtime. In particular schemas can extend other schemas, setting up a class-like inheritance mechanism. Why did you decide to design it this way, and would it have been possible to use native json-schema $refs or json hyper-schema links instead? Where those options too verbose / user-unfriendly?

The decision to use the a custom $extends property was not taken lightly. JSON-Schema doesn't (yet) support inheritance. There are ways to to get close using allOf. We tried this approach, but found that the generated Typescript definitions were not very useful largely because of the inability to set "additionalProperties": false (see comment lined to above).

Also, we want to make it really easy for people to understand and contribute to the schemas. We have found that using $extends is a more intuitive, more concise, and less error prone, way to represent the relationships between schemas than using allOf.

So in summary, the YAML-with-custom-extensions approach, provides a lighter-weight, less intimidating way to write schemas (which ultimately get translated to JSON-Schema documents; analogous to authoring Markdown that gets translated to HTML I suppose).

Hope that is of use. Again, apologies for the slow response.

@nokome nokome closed this as completed Jun 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants