-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Operation: report #205
Comments
Proposed name for this: the output product could be called a "report card" |
Should this command also produce standardized exports of things like tables of all classes and their labels, or is this overloading? Would that belong in |
I've used SPARQL for this. It's important to distinguish internal from external (imported) terms. I'm fine with including this in |
Some suggestions on potential output. For checks, these could be treated as informative for now. Working with obo-ops and the obo community we could come up with varying levels of stringency. For some we can imagine different profiles, with the profile (e.g. obo-basic) declared in header. general reporting
checks
|
currently the GO reports give a lot of reference violations deep in the import chain. We know this is the case so would rather have an option to suppress reporting reference violations, @dougli1sqrd |
An example of how this is working (all on #231 at the moment). Run report on an ontology:
@jamesaoverton - so far here is a sample of the report. We use the JSON subset of YAML so JSON is trivial too if you want the verbosity. It should also be easy to make nice HTML or Markdown from this, but this is a separate concern. report is in sections, first reference issues:
|
Regarding The key question is chaining, I think. Do you (the user) want to run a chain of commands ending with I think an |
I tried this on OBI Core. It ran quickly and seemed to work correctly. Some thoughts and comments: Having a machine-readable report is good. Maybe we can have a simple "ignore" sheet. If we use IRIs everywhere then only a few of us can just read the reports without another layer. I got items like this:
It's true that BFO entity does not have a definition, but that's not my fault -- it only has an elucidation. I think that each rule should have a URL and documentation that explains what the problem is and how to fix it. |
At the top of the issue Chris presented two alternatives: declarative and procedural. PR #231 takes a procedural approach, with the rules written in Java. I'd prefer a declarative approach using SPARQL. Things like I have a bunch of reasons for preferring a declarative approach, but perhaps the most persuasive is that we want a low barrier of entry for reading and writing rules, so that we can get as many developers as possible using and contributing to the rule set. I think that more of our developers can work with SPARQL than Java. Another consideration is that SPARQL is much easier to re-use in other contexts than the Java in #231. My biggest worry is that SPARQL performance will be much worse than Java performance. @cmungall: Is it OK for @rctauber and me to build a SPARQL-based alternative to #231 so that we can compare the two approaches? I think we can do it this week. |
Simple cardinality checks should work fine in SPARQL. Some of the logic may be dependent on configuration which is awkward. E.g. if the ontology is in the GO-lineage, the definition axioms should be annotated, if in the OBI-lineage then a definition_editor field should be present. I suppose we can encourage this configuration to go as metadata in the ontology header so this is introspectable in the query but I'd like this to work out the box for all ontologies. @dougli1sqrd and I started out going a pure SPARQL route for everything, but things like the obsolete reference checks are very difficult. Using extensions seems to defeat the purpose. SHACL/Shex may be promising for some kinds of checks but this doesn't seem advanced enough at this stage. I agree in principle SPARQL is more declarative, I'm not totally sure that we'd attract more developers. I know @drseb and @pnrobinson for example have already written lots of java code for procedural checks of HPO, are comfortable with this. Anyway I think it's OK to have an alternative for a subset of checks |
IRIs - yes I'm being lazy and using generic Jackson pass-throughs to turn the java objects into strings for now, none of my users would use this without the labels. We need to also make sure that MIREOTed entities are not checked - currently it assumes an import strategy. The logic here may get quite complex for what to check and when. As non-declarative as java is, having the ability to define reusable methods/subroutines will help a lot here... |
We should also look at OOPS But many of these patterns are under or over specified for us. E.g.
Most OBOs don't include instances, these are assigned separately
We have a principled check for this
We should explicitly check for OBO naming conventions
Oops, every OBO fails this!
we agree with this one! |
See drseb/hpo-obo-qc#1 and owlcs/owlapi#317 This is part of the robot report checks, see #205
@jamesaoverton @rctauber @dougli1sqrd @kltm in terms of documenting the checks what about option A
option B
The advantage of B is that we can treat this more as documentation, include sections with headers, examples, etc. This would be very much like the gorules system https://github.com/geneontology/go-site/tree/master/metadata/rules In either case we could have PURLs for the rules so they could be embedded in RDF metadata descriptions of results, SPARQL constructs etc |
As I was tagged, I have a great preference for "A" as it is an easier to parse and more standard format, and document extraction could still trivial been done on a single field. I am not a fan of the GO rules setup. |
I agree that we need all that metadata, that it should be structured, and that rules should have PURLs. I'm not sure about the file format. In a previous project I used plain SPARQL (and SQL) files for validation rules, and I put the structured metadata (which might as well be YAML) and a longer description as a comment at the top. So the focus was on having a file that you could just run as a query. Call that option C. @rctauber is working on some code to compare with #231, as we discussed above, and it will work this way. I don't have a strong preference among A,B,C at the moment. It will be easy enough to convert any of them to nice HTML pages. |
The decision between A/B/C in part depends on what the goal is here. Is it providing a specification or an implementation+metadata? What I think we need here is a specification. Not everything can be implemented in SPARQL, and we may want different implementations for different contexts. Even with trivial SPARQL, the query may have to be modified e.g. to run on a multi-ontology triplestore. For a specification, things like formatting and ease of editing/PRs by a variety of stakeholders of varying degrees of technical expertise is of utmost importance. Requiring a microlibrary for parsing yamldown seems like minimal imposition on developers. However, editing SPARQL inside any kind of embedding is super-annoying. Same is true to some extent for any complex nested yaml, but I think this would be fairly minimal flat yaml. Another option would be decoupling, and manage each as a folder/dir, each minimally containing a markdown description file and a metadata file:
This could be seamlessly dropped into the obo jekyll framework but comes with its own annoyances. |
@cmungall I definitely had implementation+metadata in mind, not just a specification. That's part of the reason I've been pushing for SPARQL over Java, because SPARQL can be used in more contexts. I could be convinced that we need a layer of specifications over a layer of implementations, but I worry that the specification will be vague and that its implementations will drift apart. @rctauber pushed some work here: beckyjackson@e263306 The ReportOperation mostly just loads SPARQL queries, executes them, and reports problems. Each query has "title" and "severity" as structured metadata in comments. This is just a first draft, for discussion. The structured output in #231 is a good idea, but this draft doesn't implement that. Since my main worry is performance, I'd like to compare this to #231. What would be a good test case? |
Cool. Can we test it on a SPARQL query for checking for references to dangling or obsolete classes? |
@rctauber has run some initial performance comparisons. As expected, SPARQL is slower and takes more memory, but it's not as bad as I feared.
That's all with Jena/Fuseki. We're going to try Blazegraph now. |
You can ping @yy20716 if you want help optimizing queries
…On 26 Jan 2018, at 6:17, James A. Overton wrote:
@rctauber has run some initial performance comparisons. As expected,
SPARQL is slower and takes more memory, but it's not as bad as I
feared.
- For OBI, ECO, and GO, the SPARQL implementation takes about 2-3
times as long to run as the Java.
- With both OWLAPI and SPARQL representations in memory, we need about
100% more memory for OBI and ECO, but just 25% more memory for GO.
That's all with Jena/Fuseki. We're going to try Blazegraph now.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#205 (comment)
|
After a couple of hours trying things out, Blazegraph was consistently slower and used more memory than Jena/Fuseki for this task. That's a bit of a surprise for me, since I usually find Blazegraph performs better. But in this case we're doing a lot of writes and not so many reads, so maybe that explains the difference. @rctauber will push her Blazegraph version to her repo, so you can see if we made any obvious mistakes. Based on these tests, I think that the performance hit for SPARQL is acceptable. What does everyone else think? |
Do you have rough absolute times? I forget how long the java takes |
The times are an average of 3 executions via command line. I killed the GO execution with Blazegraph after ~180 seconds.
The most recent change to Blazegraph is here (using Futures): beckyjackson/robot@70ba125 |
@yy20716 will take a look at this to see if there are obvious ways to improve the blazegraph performance, and if there are obvious ways to optimize in general |
@rctauber, if you don't mind, could you please let us know how we can reproduce your results in the table? I cloned your "reporting" branch but wonder how you switched among Java, SPARQL, and Blazegraph for testing this reporting feature. If you have a set of commands for testing these options, could you please let us know? Thank you. |
Hi @yy20716 -
If you'd like me to send you the JARs, I'd be happy to! |
Hello @rctauber, it seems that the use of newCachedThreadPool() in getViolation rather leads to the suboptimal performance in the case of Blazegraph. In my machine, the Blazegraph implementation was later failed due to the OOM (after eating up 16GB memory). I also tried with newFixedThreadPool(8), but It's rather faster when I limited the number of the thread as 1, i.e., newFixedThreadPool(1). I also observed that the use of nested FILTERs with NOT EXIST causes issues. Here is one of the queries that take several minutes in Blazegraph (even if we only executed this query without using any threading). From my understanding, this query checks whether the punctuation exists at the end of the object value described with IOA_0000115.
The problem is that the input graph pattern is not explicitly given for the NOT EXISTS clause, so most query parsers interpret the clauses as the operation that applies the filter for all triples, which is the ineffective operation. These filtered triples that do not contain dots are then joined with triples whose properties are obo:IAO_0000115; thus it is likely to take a significant amount of processing time. It seems that Jena somehow optimized these operations but Blazegraph processed the query based on the initial interpretation. If my interpretation is correct, this query can be simplified as follows.
There were other similar queries, so I modified them as well. This modification reduced the execution time up to approx. 2 mins for processing all queries over go-edit.obo. I also modified other queries such as reducing the number of union branches as much as I can. I forgot to rewrite the query I made a pull request at your Github repository and would appreciate it if you could check whether the modification makes sense to you. Thank you. |
Thanks so much @yy20716 - I merged that PR and built with the changes. The modifications do make sense, and I appreciate the explanation. Here's the new (your changes) vs. the old Blazegraph times:
GO actually completed, which is a huge improvement. But, Jena still processes in ~40% less time. |
I am really glad that it worked. Yes, it is well known that Jena is generally faster than other approaches including Blazegraph when the size of input data is small, i.e. up to 100-200MB data. Most ontologies are not that large (i.e. smaller than 200MB), but Jena's performance quickly becomes worse once the size of data becomes larger. I am not sure but maybe it would be safe to use Blazegraph if the processing time is still acceptable for you or other members of this project. If not, Jena would be a reasonable choice. Thank you. |
@cmungall Do you want to move forward with the Java approach (#231) or the SPARQL approach (https://github.com/rctauber/robot/tree/reporting)? |
SPARQL |
Nothing makes me happier than killing verbose java code I have written. Closed the PR. Declarative all the way May want to copy this bit of docs back: At the moment master is using some procedural checks to warn for dangling axioms that could lead to reasoner incompleteness. This could probably all be moved over later to SPARQL tool |
Do we want to leave this open to continue discussing the checks (since this feature is implemented now), or create a new discussion issue (or mailing list convo?)? If we moved discussion, it could subsume #217 as well. |
We should have more focused discussions on particular rules from now on. |
ROBOT performs logical checks using
reason
and can run a custom suite of SPARQL queries usingverify
. Some repos successfully use this in combination with Makefiles and a custom suite of queries to implement robust checks in their pipeline (e.g. GO; and also OSK).We should make it easier for groups to do this out of the box. While customization is good, ROBOT should still provide an OBO-ish core set of queries and reports that could be used by OBO central for evaluation purposes. The groups could still customize and say for example "allow some classes to lack definitions".
In GO we are still very reliant on this script, it depends on obo format but it is a good model for a set of potential standard checks: https://github.com/geneontology/go-ontology/blob/master/src/util/check-obo-for-standard-release.pl
This could be implemented by robot distributing some sparql in its resources folder (or a standard obo place to fetch these) and/or procedural code (sometimes easier than sparql)
The text was updated successfully, but these errors were encountered: