-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turtle quickcheck -- DO NOT MERGE! #103
base: master
Are you sure you want to change the base?
Conversation
… serialization. Theses tests reveal a possible problem with the `Arbitrary` instance for a `RDF TList`: the same URI is occassionally used for more than one node type. I.e., a URI might be used for a subject and predicate or subject and object. The instance should be updated to not used duplicate URIs for different node types.
…cross Node types.
BNodes are now unique across subject, predicate and objects.
I've pushed some more commits that ensure that the same URI is not reused across subjects, predicates and objects. This is the case for both |
@robstewart57 Do you have any thoughts on this work? Let me know if you think changes are needed. |
-- Assert that the graph serialization (first parameter) contains | ||
-- exactly one instance of the given subject (second parameter). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmciver that seems to be subtly different from what we discussed in #99 (comment)
Your countMatches
function looks like the needle (the subject) must not appear anywhere in the bytestring haystack (all subject/predicate/object data).
That would mean the prop_SingleSubject
property would not allow:
<http://www.example.com/a> <http://www.example.com/b> <http://www.example.com/c>
<http://www.example.com/d> <http://www.example.com/e> <http://www.example.com/a>
<http://www.example.com/f> <http://www.example.com/g> <http://www.example.com/a>
Since, when calling assertSingleSubject
with the subject <http://www.example.com/a>
as its second argument, the needle <http://www.example.com/a>
is part of the haystack in the 2nd and 3rd triples.
Have I misunderstood the intention here? Or indeed, your code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robstewart57 You did not misunderstand the code: you're right, this test does not account for a subject node in object position. Good catch! Considering that, this property seems much more difficult to test than I thought.
As I think about this, one possible solution might be to use the Turtle parser on the input graph, count the number of times a subject node occurs in non-subject position and then add 1. That should give the number of times that URI should occur in the ttl file.
One issue that this approach suffers from is if that URI appears multiple times in predicate position and then predicates are grouped, then that calculation will be wrong. I think this situation is rare but I think it's possible.
Another option would be to create a special-purpose parser that counts the number of times a given URI appears in subject position. But this seems like a lot of work and could therefore have its own bugs.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmciver Hi,
An alternative option is to implement an Arbitrary
instance of a special kind of RDF graph whereby:
- A URI that appears in a subject position may appear in multiple subject positions.
- A URI that appears in a subject position must not appear in any predicate or object positions in the graph.
The QuickCheck API should permit these constraints, and the prop_SingleSubject
property would then be valid.
We could do with other property tests which tests for various identity functions, e.g. #99 (comment) and parsing, serialising then parsing again the same RDF data content for each of the given RDF formats that we so far support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rob,
A great idea! I've been pretty busy of late but will try to work on that this week.
@tmciver Hi Tim, sorry for the delayed response, work very busy just now. I've added a comment, above. |
@robstewart57 This work is intended to use QuickCheck to ensure that subject grouping is happening correctly in the Turtle serialization as was discussed in #99. I've run into an issue and wanted to get some feedback on the technique used. Basically the approach is naive: serialize a graph to Turtle format and check that the file has only one instance of each subject. This works for the most part but the QuickCheck tests occasionally fail and it turns out that they fail because the same URI is sometimes used for both subject and predicate, or subject and object. I think the solution is to update the
RDF TList
Arbitrary
instance to not use the same URI for more than a given type of node. Thoughts?