Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Re)design Discussion #18

Open
jamesaoverton opened this issue May 27, 2021 · 9 comments
Open

(Re)design Discussion #18

jamesaoverton opened this issue May 27, 2021 · 9 comments

Comments

@jamesaoverton
Copy link
Member

RDFTab Design Steps

  1. IRI, blank node, CURIE, label
  2. RDF subject predicate object
  3. really subject predicate object datatype
  4. mention stanza column
  5. add graph column
  6. RDF-JSON for objects
  7. RDF reification column
  8. OWL annotation column
  9. OFS for objects

We use RDF and OWL and SQL. How can we best use them together?

The elements of RDF are IRIs, blank nodes, and literals. IRIs are long and inconvenient, so lets use prefixes and CURIEs. We'll wrap IRIs in angle brackets to distinguish them from CURIEs. We'll say that an "ID" is one of a CURIE, IRI, or blank node.

RDF triples consist of a subject, predicate, and object. The subject can be an ID. The predicate can be a CURIE or IRI. The object can be one of four things: an ID, a plain literal, a typed literal, or a language tagged literal. We'll define a "datatype" to be:

  • "id" for an ID
  • "plain" for a plain literal
  • a CURIE or IRI for a typed literal
  • "@" and a language tag for a language tagged literal

In most RDF serializations the datatype comes after the literal content, e.g. "123"^^xsd:integer. But it's often better to read the datatype before you read the literal content, and while the 'object' is often a longish string, the datatype is usually short. So I think it's more convenient to put the datatype column before the object column.

Now we can represent triples in a table with four columns: subject, predicate, datatype, object. These cells will never been NULL.

We often want all the triples associated with some named subject. This is useful for term browsers and term extraction. But RDF includes linked lists and reification, and OWL includes nested structures and annotation axioms, all of which use a lot of blank nodes.

One way to keep these blank node structures together is to add a 'stanza' column which names the top-level subject for a set of triples. You can then select all the rows for a given stanza and have a subgraph with most of the triples relevant to your term. We won't need the stanza column if we use JSON structures mentioned below.

RDF also includes named graphs. We can add a graph column with an ID or "default". Note that OWL does not support named graphs.

Blank nodes can be difficult to work with. One of the many advantages of Turtle syntax is that it hides blank nodes behind [], {}, and () syntax. We can do something similar using simple JSON structures. Let's represent an RDF object by a JSON object. Where we would write { ex:o1 } in Turtle, let's write this JSON [{"object": "ex:o1", "datatype": "id"}] and call it an "object set". Where we would write [ ex:p1 ex:o1 ] in Turtle, let's write this JSON {"ex:p1": [{"object": "ex:o1", "datatype": "id"}]} and call it a "predicate map". When the 'object' column holds a predicate map, then the datatype column must be 'predicate-map'.

As mentioned above, RDF includes reification, which allows us to make statements about a triple. We could eliminate more blank nodes by keeping the RDF reification in the row of the target triple. So we will add a "metadata" column which will contain NULL or a predicate map JSON structure capturing zero or more triples with this triple as the subject. This is similar to writing RDF*. OWL has a similar "annotation axiom" system. We'll add an "annotation" column and handle it in the same way.

Finally, the RDF representation of OWL is hard to read. OWL Functional Syntax (OFN) is relatively easy to read, but we don't want to have to parse and render it when we're working in SQL. So let's use a JSON array, and shift the OFN keyword as the first element of the array, like an S-Expression from LISP. For example ["ObjectSomeValuesFrom","ex:part-of","ex:bar"]. We'll call this an "OWL Functional S-Expression" (OFS). When the 'object' column holds a predicate map, then the datatype column must be 'OFS'.

What have we got? Seven columns: graph, subject, predicate, datatype, object, metadata, annotation. The ability to represent anything encoded in RDF 1.1. Very few blank nodes to worry about. A convenient syntax for OWL. RDF and OWL and SQL living together happily.

What are we missing? Mainly a way to reason "inside" the database.

Easier RDF

https://github.com/w3c/EasierRDF discusses ways to make RDF easier. I think this design addresses some of those things, but if you see something that's within easy reach, please mention it.

Design Decisions

  • datatype column before object column?
  • settle on a convention for the datatype keywords
  • omit "datatype": "id" from JSON as redundant?
  • use one-letter names? gspdoma
  • extend OFS beyond OFN:
    • RDF lists
    • anything? extensible?
  • graphs as tables instead of a column?

Feedback appreciated from anyone, especially @cmungall @beckyjackson @lmcmicu @ckindermann.

@cmungall
Copy link
Contributor

Remind me - for the proposal to use JSON objects in place of blank node syntax, would these still be subjects in other statements?

E.g.

s p o
["ObjectSomeValuesFrom","ex:part-of","ex:bar"] rdf:type owl:Restriction
["ObjectSomeValuesFrom","ex:part-of","ex:bar"] owl:onProperty ex:part-of
["ObjectSomeValuesFrom","ex:part-of","ex:bar"] owl:someValuesFrom ex:bar

I really like being able to use views or otherwise get at existentials without JSON parsing, e.g.

SELECT
sc.subject AS subclass, svf.in_property, svf.filler
FROM
rdfs_subclass_of AS sc JOIN
some_value_from AS svf ON (sc.object = svf.subject)

If this is still planned then in principle I don't care about the structure of the blank node / anonymous expression column values. Although I would like to check performance implications. I assume these are all interned.

@cmungall
Copy link
Contributor

Re: graphs as tables vs columns. the same question could be asked of predicates. There is utility in having a table for rdfs:subClassOf etc.

However, I strongly prefer keeping the base generic, and allowing people to either make views or derived tables for different slices according to their use case. In fact it should be straightforward to write procedures in either a normal programming language or in something like plpgsql that auto-created views and tables for graph-perspectives and predicate-perspectives (you could do class perspectives too, e.g. select * from owl_class, or select * from obi_nnnnnn). But having the base be generic keeps things maximally simple and flexible.

@jamesaoverton
Copy link
Member Author

I have not been planning to use JSON in the subject column. GCIs would still have blank nodes as subjects, for example.

There may be edges cases I’m missing, but the basic idea is just to collapse self-contained blank node structures into JSON, which should be equivalent to Turtle’s syntactic sugar.

@jamesaoverton
Copy link
Member Author

Maybe I'm not understanding. In any case, these are the Thick Triples Examples we're working on.

@jamesaoverton
Copy link
Member Author

I think there are practical benefits to keeping each graph in a separate table, mainly keeping indexes small for small graphs. Then you would JOIN the tables you want to query, or have a view of whatever. I guess if you always wanted to query over all graphs then one table would be better.

Of course it's better to measure than to guess.

@cmungall
Copy link
Contributor

I have not been planning to use JSON in the subject column. GCIs would still have blank nodes as subjects, for example.
There may be edges cases I’m missing, but the basic idea is just to collapse self-contained blank node structures into
JSON, which should be equivalent to Turtle’s syntactic sugar.

OK, so it sounds like if we wanted to query ?x subClassOf ?r some ?b (i.e the same as https://cmungall.github.io/semantic-sql/OwlSubclassOfSomeValuesFrom/) we would need to parse json?

@jamesaoverton
Copy link
Member Author

Yes, you'd use SQLite's JSON operators.

Thanks for the feedback. I think we just need to set up a thorough comparison for a whole bunch of use cases.

@lmcmicu
Copy link
Collaborator

lmcmicu commented May 31, 2021

This all seems good to me. Remind me about the reason for one letter column names ... is it just to save space (I can't imagine it would save all that much), or is there another reason?

@jamesaoverton
Copy link
Member Author

Brevity was the only reason for the one-letter names, but clarity is more important, so I think we'll stick with the one-word column names.

I started work on this https://github.com/ontodev/tooling-comparison which I hope will be useful to make some comparisons and guide some design decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants