-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Content of Turtle and RDFa documents should be wholly and entirely preserved #342
Comments
There is no citation for you, @RubenVerborgh, because those MUSTs are not taken from RFC, REC, or the like. They do, however, flow from the understanding that I believe was commonly shared among the LDP WG when writing the LDP REC, upon which Solid at least claimed to be based at one time, even if Solid is now dis-claiming that basis. One of Solid's early claims, if not promises, was to be a filesystem-ish datastore, which could be backed by an RDF store and could support SPARQL over the stored data, but both SPARQL and RDF features were considered a heavy-lift, so these were not promised. As an author of RDFa and Turtle documents, I consider the HTML content in the former and the whitespace and comments in the latter to be important -- else I would not have created RDFa at all, nor added whitespace and comments to the latter to make human processing of these files easier. In other words, if I did not care about the non-RDF content of these documents, I would have chosen a more "pure" RDF media type that did not support comments and/or author-applied whitespace -- which might include JSON-LD, which certainly does not support comments, and does not promise to maintain whitespace. Still, even with more "pure" RDF documents, I would not expect them to be broken down into their component RDF and only that saved in an RDF store without clear warning and user choice. I would generally expect the documents to be preserved as documents, complete with all their metadata (creation date in particular, but also modification date, and any other metadata supported by and transported from their origin to the Solid store). To some of your specific comments...
There are no "RDF whitespace and comments". There are syntactically valid, and presentationally and contextually valuable, whitespace and comments in Turtle documents. There is syntactically valid, presentationally and contextually valuable, content (a/k/a data) in RDFa documents.
I, and I am quite sure others, would not be happy to have some of their data discarded without warning because you, and possibly others specifying and coding Solid and related tools, decided that their Turtle- or RDFa-contained data was not really data. To my knowledge, there is nothing about the IANA-maintained media type lists which says "this part of this media type's content is real data and SHOULD/MUST be retained, but that part of this media type's content is not 'real' data and MAY be discarded." Minified and unminified JSON (and JSON-LD) are trivially and losslessly transformable into each other. There are no comments to be retained or lost, because JSON doesn't support comments (as of today, that is; I believe an upcoming JSON version, already in progress, will add support for inline comments). Inline newlines are only supported as Again, I have no RFC, REC, nor other "standard" or "specification" citation to offer, in significant part because the standard-setting bodies working on relevant specs in which I've participated agreed with my assessment above -- that non-RDF content of RDF-encoding media types is just as important as the RDF content, and all should be preserved -- and acted and discussed other things with the preceding as foundational and self-evident. Had there been any disagreement, I would have made sure that explicit statements to that effect were included! |
I understand the desire, but I do not understand that there has been a clear historical path here. LDP defines: Linked Data Platform RDF Source (LDP-RS) as "An LDPR whose state is fully represented in RDF, corresponding to an RDF graph." which in very clear terms do not include whitespace, comments, etc. Then, Linked Data Platform Non-RDF Source (LDP-NR) as "An LDPR whose state is not represented in RDF. For example, these can be binary or text documents that do not have useful RDF representations. " The class of resources that you want are clearly not included in either of these classes, this is a class of resources that have a useful RDF representation, but where the representation does include more than RDF that is significant. If that understanding was present in the LDP WG, I don't think it was in any way adequately expressed in normative statements. I believe that we can satisfy this by introducing such a class, and perhaps we should (I certainly can feel the pain around HTML+RDFa, and I'm sure @csarven feels it even more). I just don't think that can be a requirement right now, even if NSS could support that. |
For the umpteenth time: I was an active participant in the LDP WG. I am reasonably certain that I understand what we meant, frustrated that we failed to communicate that, and more frustrated that my clear statements of our intended meaning (which I believe have yet to be disputed by anyone else who participated in the LDP WG, leaving me with no feeling of having misunderstood our intention) to clarify others' misunderstandings and misinterpretations are discounted and/or ignored. Our simple error here lies in having left out one word from the LDP-NR definition: "fully". We did not intend to have three types of LDPR, with two identified and defined explicitly as noted above and one identified and defined only implicitly by the gap between those two. We intended to have two types of LDPR, as explicitly identified above, i.e., LDP-RS which are "fully represented in RDF", and LDP-NR which are "not [fully] represented in RDF". Nothing in the definition above says that an LDP-NR cannot include RDF as part of its payload, only that RDF is not all of its payload. |
We've been through this material and discussion several times. I've been content to assume that Ted is sharing knowledge of LDP in good faith for some time now. Unless the editors or authors of the LDP spec or LDP WG members can clarify the language, I suggest we take that as is. Debating whether the language in LDP is meaningful or intuitive against Ted's explanation of what's intended is not particularly useful at this point. (I've been down that road. We don't all have to.) Having said that, the Solid Protocol - or at least some of the original servers and clients - neither assumed or required LDP compliance. (I know because I wrote a client that expected the servers to implement LDP but they didn't actually deliver despite the fuzzy claims.) NonRDFSource and RDFSource was not something adopted or deemed necessary for Solid. It seemed to create more concerns than actually being practical. From my perspective, this whole thing turned into moving the Solid Protocol forward without bringing the whole LDP baggage (YMMV). But we are not completely free since we do expect containers in Solid to behave like LDP(B)C. Perhaps we just need solid:Resource/Container (semi-serious suggestion). |
In early-ish days, Solid tied itself loosely to LDP, which was supposed to be a good thing because LDP servers were supposed to require minimal adjustment to also be Solid servers. I am content to let Solid break from that loose binding IFF it doesn't lead to silently broken expectations of document preservation. In my world, when I upload a document -- whether that be Turtle, HTML+RDFa, JPG, PNG, CERT, or otherwise; and whether or not the store to which I'm loading it knows how to extract information from that document -- to a document store of any kind, I expect that document to be retrievable in the same condition, with the same content, as when it was uploaded unless it was edited/changed by some process I explicitly either approved or initiated. That includes when I write a Turtle or HTML+RDFa document to a Solid pod. |
"First, do no harm." Or more aptly, "First, lose no user data." Also, there's the whole "distribution of effort" credo (which I can never bring to mind by the name others use for it) that says users should have to think the least, in order, after deployers, after programmers, after specifiers... We (OpenLink) use a lot of Turtle in our line of business applications (like this document), among other things. If you take a look there, you'll find a fair amount of comment and whitespace content included to aid human comprehension and future text-editor-based revision.
The more unfamiliar the realm of discourse, the more important such refinements are. Now, users could describe one Turtle file with another, and annotate the first with line-number-based identifiers, instead of inline comments, but that puts a lot of work on the Solid user, which is not required by any other document repository ... but once your Solid server that's refusing to preserve Turtle documents in toto ingests the content of those two files, there is no guarantee that the line-number-based descriptions from the second will remain accurate for any Turtle file that is eventually output with the same RDF graph as the first, because there's no ORDER required by SPARQL, etc. |
I'd suspect the WG captured some of the background in the mailing list, meeting minutes or the W3C issue tracker. I suppose we weren't compelled enough to dig into that - but that shouldn't stop anyone from doing that now. LDP's constraints re "fully" is (unnecessary complex) plumbing either way. If we need a class to distinguish RDF stuff from non-RDF stuff, we can use solid:RDFDocument or solid:RDFSource (based off rdf11-concepts) :) The rest is just media types, multiple representations, graph comparisons. Sure, that "loss of quality" is not about RDF and the request is specifically for a representation that's a lower quality (i.e., usually in one of the concrete RDF syntaxes other than RDFa). Shrug. If we are not talking about RDF graph comparison, "loss of quality" can be expected when going from any concrete RDF syntax to another. |
@TallTed I certainly acknowledge your contribution to LDP and Solid. It is just that expectations aren't much what implementors base their code on. The difference, from an implementation standpoint given those assumptions is large, for RDF sources, it is clearly legitimate to persist only the RDF graph and so you use technologies available for that like a quad store. and serialize when people make requests. For non-RDF content, it is clearly legitimate to not persist the RDF graph and so, it creates a divide. For a third class, you'd have to do both, and so, the specification needs to be abundantly clear about it. I am already quite convinced that we need it, I'm not as hard to convince as @RubenVerborgh , but still, there is a piece of work that needs to be done. |
@RubenVerborgh @kjetilk -- Am I not, as a Solid user, able to store arbitrary documents in my pod, as well as application-generated data files? I know, everybody's first thoughts are about pictures and movies, but textual documents are also among the commonly shared sorts of thing, and there's no reason why I, as an early Solid adopter, might not hand-edit some Turtle about those pictures and movies, or about some stuff that's not now (and might soon or might never be) in my pod.... What you're telling me here is that I must not be a tinkerer, I must not hand-edit my Turtle to please me, because only the app authors are allowed to do this. |
@TallTed you most certainly are able to store anything. As for tinkering with files, you can do that too with a Solid server that supports that, like NSS and CSS with certain backends. But you'd also have to be aware that if you're not editing through means provided by the protocol, it is going to be things that are hard, like you need to make sure that containers are updated if you create a file on the file system, that etags and last-modified times are updated, that you protect the data from unauthorized access, etc. It doesn't exclude that kind of assumptions, but it allows servers to make an assumption that it persists the RDF graph. |
When did that become the case? You've always been able to store "text/turtle" by hand to a Solid Pod. A Solid Pod provides a filesystem-like experience to Agents (users or machines). I don't understand your response here, please clarify. |
@RubenVerborgh -- I would be delighted to not need to keep harping on this issue, but where comments by you and others in other issues demonstrate the validity of the assertions I've made here, which have yet to be accepted here as valid by you and others, I think it is valid to say so there. Your own sample JSON data incorporated comments which made that sample invalid as JSON, and when I flagged that, you noted that you could have used Turtle where your inline comments would have been valid. Are you now asserting that those comments were not important enough to retain? If so, I have to wonder why they were important enough to include in the first place ... and further, how other readers of your JSON snippet were meant to fully comprehend it, as the content of those comments was not replicated outside of the JSON snippet. |
You are making presumptions about potential readers with which I disagree. There is no certainty that "everyone on that page" knows any particular thing, including the disparate (lack of) support for comments in JSON and Turtle. It is my feeling that you want to exile any comments about full preservation of documents which you don't think need to be fully preserved, and gloss over even your own comments elsewhere which support my position and undercut yours. I would have thought it clear that I am not trying to sneak anything in anywhere. Rather, I am speaking in full view, where it seems relevant to me to do so. I will also thank you not to deride my comments as mere "vent[ing] about #342", which suggests that there is no merit whatsoever to my position, which you have justified simply by the fact that my position is not yours. If Solid is going to go forward without promising to preserve the full content of documents which Solid has decided are not worthy of full preservation, then I believe Solid will die a deservedly painful death, and has the potential to ruin many users' days along the way, as they discover the loss of their data. I would prefer that neither of these fates come to pass, and rather that Solid preserve the content of my, and their, documents. |
The most straightforward step is to warn the user when document content -- whether it's HTML within RDFa, or comments and/or whitespace within Turtle, or something else -- is going to be lost through that I believe there's been general agreement that HTML+RDFa documents should be preserved, and so they might not be The requirement to update resources via My proposed requirement is not to "preserve comments" but to "preserve all document content other than that which is explicitly and intentionally and informedly altered or deleted, with user consent, by user action" which might take the form of a confirmation dialog regarding a Turtle document, e.g., "Your |
I believe the structure to which you refer is inherent to JSON[-LD], and that that structure may be transformed in any direction multiple times without loss of any information, such that any structure may be retrieved/regenerated from any other structure. I don't think JSON[-LD] permits arbitrary whitespace similar to that permitted by Turtle. If it does, then that JSON[-LD] whitespace should also be preserved. I know that JSON includes JSON-LD (i.e., JSON-LD documents comprise a subclass of JSON documents); I do not think the reverse is so. In other words, if there are properties in a JSON document that do not map to RDF, that document is not properly treated as nor considered to be JSON-LD, even if it is named
When in doubt, preserve.
End users receiving that alert may decide to use a different client tool and/or server to edit the document, such as one that does not use They might also choose to sacrifice their comments, whitespace, etc.
Applications which are maintaining their own documents would not be restricted in such maintenance; those documents would almost certainly not have human-friendly whitespace, etc. My main concern is with documents the user has chosen to store in their Solid Pod, where they should have no expectation of other tools messing with their documents' content.
"A power-user wants to store a Turtle document alongside a set of photos, with manually edited Turtle descriptions of those photos. They want to preserve indentation and other semantically invisible but syntactically visible whitespace, such as column-aligned predicates and objects, to ease future edits of this Turtle document. They also want the content of this document to be ingested (but not maintained) by the gallery tool (which may not exist as of this writing), such that |
You're correct on JSON-LD with un-URI'd terms; my bad. That said, "ignore" is not the same as "delete" or "drop" or "blackhole". Which is, it seems to me, another argument in favor of what I'm arguing for: i.e., PRESERVE THE CONTENT YOU DON'T UNDERSTAND OR RECOGNIZE. (Remember that this is also how HTML-based web browsers "fail elegantly" with HTML tags they don't recognize -- not by deleting nor by not rendering the content within such unrecognized tags, just by treating those tags as if they weren't present.)
I have long been arguing against allowing If there is no such thing as a "document which is entirely represent[ed/able] in RDF", then there is no such thing as an All that said... I do not find it acceptable for Solid (nor any other service or server) to destroy data in my documents of any format or media type without at bare minimum telling me that's about to happen (e.g., "This document is about to go through a lossy transformation, from HTML+RDFa to N3, retaining only the RDF triples found in the original document"), and waiting for my approval to do so, before doing it. That's not going to change. |
@TallTed would dedicate auxiliary resource preserving verbatim representation satisfy your requirements? I think one would just need to anticipate that RDF in it can get stale but at least that would provide some way to retain the formatted version that was provided at some point. There would be still a lot of problems to answer but I thought we could try brainstorming a bit. |
@elf-pavlik auxilliary resources can only be LDP-RS(s) at present. Which have same issues of non-auxiliary rdf resources in preserving comments, whitespace etc. |
As I understand it, most Solid end-users will likely never hand-edit RDF files, so I do think it makes sense to not preserve comments and whitespaces by default. However, instead of using This would allow applications to still see RDF files that need such non-RDF data preserved with their proper media type, but still be able to take into account the fact that such non-RDF data must be preserved, so that operations such as PATCH may have to be applied differently (or are impossible). For example: Content type:
Content type:
|
There may be reasons to preserve comments, even if users don't hand edit them. Thus we can't use solid to just store and retrieve arbitrary folder of source code with turtle configuration files, and edit them as code, with valuable comments. There are times when files are used not just as backend to store information, but as files themselves. using profile seems better way forward. |
And in cases like these, where we use ttl files as part of source-code/config-files, and persist them in solid, with same dc:format, we may have to think whether these files should contribute to resultant knowledge graph of a pod. Thus we may(?) have to distinguish between those rdf-resources which contribute to knowledge graph(-store), and resources, which may have mime-type of turtle|json-ld|etc, but are just opaque documents, which doesn't contribute to knowledge graph(-store). Source-code files are examples for second kind. Intention of storing them in solid is to just to host them, but not to add to knowledge graph of pod. as mentioned by @csarven usage of two different |
Just to think aloud, First kind can be RDFSources (near to LDP-RS), which contribute to knowledge graph. Their essence is just knowledge. And representations used in transit for them are just manifestations of that knowledge. These resources are rdf-patchable as with what ever patch format solid-spec recommends. Second are Non-Rdf-Sources(near to LDP-NRS), which now can have any mimetypes, including ttl. but they don't contribute to resultant knowledge graph. Their essence is their bytes themselves. lossy conneg can be allowed. And they are not patchable through knowledge patching mechanisms. They may be patchable by custom standards of file-diff/json-diff/etc. if required. |
This direction, seems to me, has great potential to explore further. Few questions that come to my mind:
|
Based on @elf-pavlik's earlier comment, |
I think requiring users to set a profile on a media type, or to force them to set As to lossiness... A lossy A lossy A lossy To a fair extent, these issues flow from Solid's declared usability as both an application substrate and a human-accessible filestore. Drop the latter, and many things become permitted that I believe are not acceptable and therefore should not be permitted if the latter role is to be maintained. |
I think if someone uses some low-level client directly we can expect them to deal with profiles. If someone uses higher-level application it's up to the developers of that application to implement support for discussed profile or not. Any user can always choose an application that provides the features they need.
I find it ok if some application fails the user once - where it just loses some formatting/comments. When that happens users can look for an application that preserves them. From the spec perspective it seems reasonable to provide a straightforward way for applications to offer such a feature. Given that every user having appropriate access to update the resource can choose their preferred application. It may be worth dedicating some time to think about supporting version control, so given storage can preserve all the history of every verbatim representation, everything else probably always will lead to loosing some information. |
The Solid definition of "RDF documents" is vital here, and apparently must be revisited in all related discussions, because it continues to be a moving target, to my eyes. The The current Solid definition of "RDF documents" as used here is becoming clear to me as "any document that includes RDF, where the Solid team considers the non-RDF content to be unimportant". That will lead me to minimize my use of Solid servers, no matter how much I agree with the base premises that the Web should be Writeable as well as Readable, and that my data is mine and should be under my control.
That is not the message that was communicated to me by any interview with @timbl nor any early conversations about Solid's promises, regardless of what was actually implemented in early Solid servers (of which I have understood Node Solid Server to be the primary "reference implementation" focus of 0.9 spec development, the result of which I have understood to be not a forward-looking prescriptive spec, but a backward-looking descriptive spec, with some vague hopes of shifting to a forward-looking prescriptive spec for 1.0 if not later). So. Optimizing for the 80% of use cases, as you characterize them, is not unreasonable IFF the effect of such optimization is clearly communicated, and that includes warning users who try to store something which content is not going to be wholly preserved, such as Turtle with comments or with long sequences of whitespace (the sniffing out of which I would think more complicated than simply retaining the original document, but what do I know?). Now, you've characterized "retaining the entirety of Turtle [and HTML-RDFa, and other |
The media type of Do you not comment your code, where the language allows for such? Do you expect those comments to be retained or dropped, when you store the code somewhere? I keep raising HTML+RDFa because it is a parallel use case. Yes, HTML+RDFa starts with HTML and adds RDF, and Turtle starts with RDF and adds comments/formatting -- but the result in both cases is a document with both human-targeted and machine-targeted content. It's a different situation if the RDF-containing media type does not support comments or machine-invisible, human-visible formatting (a/k/a whitespace). |
Comments (and indents) in C are not compiled, but they are retained in the C language documents! Just as I want my comments and indents in Turtle to be retained in my Turtle documents! How can this be so hard to understand?
Your decision loses data. This is, or should be, an absolutely unacceptable non-starter for any system that presents itself as usable as a document store, regardless of the flavor of the documents it stores. @timbl -- What say you? Do you not want and expect your inline comments to be preserved in your Turtle documents, unless you do explicitly consent to the serialized, materialized Turtle being transformed into abstract RDF and/or loaded into an RDF store, thereby dropping your comments and any whitespace formatting into the bitbucket? If this is just "the default" way of NSS (and whatever other Solid Server implementations), then there should be a switch somewhere -- and I don't much care where, except that it should be adjustable by the user, not only by the SS admin who is just as likely to say "this is the default and we've always done it this way and we're not changing anything for you", and it should be trivially accessible, minimally brought to the user's attention upon their attempt to store a document from which data will be lost if it is stored with the settings in that position. The default on a new instance should be to preserve data -- i.e., to preserve comments, indents, etc. -- and I think that if this is too difficult to make the default on upgrade instances (i.e., making the setting for existing users to not preserve data), then those existing users will just have to get used to the new behavior -- because discarding data without explicit user consent is not and should not be acceptable. |
Originally discussed by @RubenVerborgh, @kjetilk, and @TallTed in #301 (comment) and preceding
(@TallTed Should you want to continue the discussion, could you please open another issue as per @kjetilk's suggestion?)
The text was updated successfully, but these errors were encountered: