-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expressing metadata in multiple scripts/languages #124
Comments
@murata2makoto please comment on behalf of JEPA |
Expressing each piece of metadata (e.g., titles and author names) in multiple scripts (CJK ideographics and some phonetic script (e.g, Kana)) is a must for Japanese. This is because there are multiple and equally reasonable phonetics for some names. For example, 智子 might be pronounced as either Tomoko or Satoko. We really do not know. |
Thank you @murata2makoto for this feedback. I think that we need to carefully reconsider what's in our infoset. Here are a few notes:
I'll also open a separate issue about reading direction, this is a related problem but it's more difficult to solve IMO. |
My approach would be:
I think we should include the consideration on directions into this issue right from the start. |
RWPM has a slightly different approach:
In the RWPM context we're currently using a language map, which works fine when we only define a language but couldn't include a direction as well (it MUST use a string or an array of string). This is an example on a title: "title": {
"fr": "Vingt mille lieues sous les mers",
"en": "Twenty Thousand Leagues Under the Sea",
"ja": "海底二万里"
} We could of course adopt a different approach instead of a language map, for example: "title": [
{
"@value": "Vingt mille lieues sous les mers",
"@language": "fr"
},
{
"@value": "Twenty Thousand Leagues Under the Sea",
"@language": "en"
}
] This would allow the inclusion of an additional |
@HadrienGardeur yes, this is roughly what I had in mind, and I believe what you wrote are equivalents, but adding the direction may lead to some problems. Can we keep this in hold and I would look at this later this week? |
@HadrienGardeur
is invalid JSON-LD. Which means that, e.g., the JSON-LD playground rejects it. We could come up with a hack. Another possibility is that we wait for the JSON-LD WG to be formed and raise an issue. Yet another is that we raise an issue with the CG that delivers JSON-LD 1.1. In any case, I would feel bad coming up with some sort of a hack ourselves... |
This is why I like HTML. 😸 |
Well, actually... you gave me an idea. We may go in this direction (but not necessarily the way you think it:-). The problem with the whole directionality is when things get mixed up; ie, when the language itself is not enough to a proper interpretation of the characters and the BIDI algorithm needs some extra "help". The texts of @r12a like bidirectional text in HTML or (the bidi algorithm description](https://www.w3.org/International/articles/inline-bidi-markup/uba-basics) do a much better job in explaining, better than I will ever be able to do. The situation may come up in titles. The problem, if we use JSON-LD, is that RDF strings do not have built in constructions to handle that. The only thing you can do is to assign a language or (and that is an exclusive "or") a special datatype. On the other hand, HTML gives all the tools that are necessary to describe the intricacies which, let us face it, are not the majority of our usages. But... RDF (and therefore JSON-LD) offers a hack: the rdf:HTML datatype. Essentially, it says "consider this text an HTML fragment, and interpret it accordingly". But then, it is perfectly possible to do the following in JSON-LD (using an example from Richard's text:
This perfectly fine JSON-LD (and RDF). It feels and smells like a hack, but may give a direction (sic!) for thoughts... (I was wondering about a slightly different direction, namely to define some parametrized datatypes that would combine a language tag and a writing direction, but we would end up reproducing the HTML semantics...) Cc: @r12a (Note that JSON-LD 1.1 has a type-based indexing just like the indexing along language tags, as used in the example of @HadrienGardeur. We may have a use case for a 'datatype' indexing for the evolution of JSON-LD 1.1) |
I really dislike this solution, it will make things extremely complicated for User Agents and I don't think it's making things easier for authoring either. Forcing UAs to implement on every string:
... is not my definition of a good solution for a super specific problem. We've already solved the issue of multiple languages/scripts with the language map in RWPM, I'd rather wait for the UTF-8, RDF or JSON-LD community to solve this issue with reading direction than implement a hack that will deeply impact everything we do. |
@iherman interesting approach. We didn't quite go that far with Web Annotation, but certainly spec'd out how one would express HTML in JSON-LD as well as There are several different ways to model this for moving HTML around inside JSON-LD. However, I think the whole thing has a bad smell. What's regrettable is that we are re-recording information which is also likely to have HTML representation. Because of the needs of i18n (which are legitimate needs!), we're now likely to need to put HTML into JSON... Consequently, I think we need to get the JSON out of the way, and reappraise the needs of an "infoset" serialization. HTML is far more expressive, human readable, extensible, displayable, and has far more multi-lingual work done for it. JSON has nearly none of those features, and is likely simply to be used as a "transport" format ultimately to be put back into some HTML-based UI. I'd like us to reconsider the decision made in #7, because I'm hopefully the reasoning behind using HTML (rather than JSON) are increasingly clear to more folks. |
@HadrienGardeur I did not say I like it:-) But I do not see, at this moment, any better solution within the existing specifications. As I mentioned, my other option was to define a number of datatypes and use those, but that would require an extra specification work and to get at least some of the RDF environments to accept the datatypes. If we use JSON but not JSON-LD, then the problem does not arise, in fact. We can easily add a direction to any structure, and the only complication for a JSON parser would be to accept, for a key, either a string or an object that includes a string value with additional information about it. A pretty standard way of operating in the JSON world. |
BTW, I've looked at what we have in EPUB right now, and while it partially solves the problem for language, it doesn't handle the issue completely for direction: <package dir="ltr">
<metadata>
<dc:creator opf:alt-rep-lang="ja" opf:alt-rep=" 樹春上村">Haruki Murakami</dc:creator>
</metadata>
</package> I can't express the direction of the It does provide a little more flexibility than our infoset though, since a number of elements allow |
@iherman curious how you see this being that much "cleaner" in JSON vs. JSON-LD. |
I can do in JSON what I cannot do in JSON-LD, see #124 (comment). What stands in a way is the JSON-LD restrictions or, to be more precise, the RDF restrictions... |
@iherman that looks a lot like the format we made for Web Annotation (which is JSON-LD): https://www.w3.org/TR/annotation-model/#example-4 {
"@context": "http://www.w3.org/ns/anno.jsonld",
"id": "http://example.org/anno5",
"type": "Annotation",
"body": {
"type" : "TextualBody",
"value" : "<p>j'adore !</p>",
"format" : "text/html",
"language" : "fr"
},
"target": "http://example.org/photo1"
} What am I missing here? 😃 |
@iherman you're 100% right that this is a JSON-LD issue rather than a JSON issue. Here's an example in pure JSON: "title": [
{"language": "fr", "value": "Vingt mille lieues sous les mers"},
{"language": "en", "value": "Twenty Thousand Leagues Under the Sea"},
{"language": "ja", "value": "海底二万里", "direction": "ltr"}
] This example goes beyond what EPUB 3.x supports:
We just need to figure out how we could avoid parsing |
@iherman I just tried the following example in JSON-LD playground and it works fine: {
"@context": {"title": "http://schema.org/name"},
"title": [
{"@language": "fr", "@value": "Vingt mille lieues sous les mers"},
{"@language": "en", "@value": "Twenty Thousand Leagues Under the Sea"},
{"@language": "ja", "@value": "海底二万里", "dir": "ltr"}
]
} It ignored |
@iherman <https://github.com/iherman> you're 100% right that this is a JSON-LD issue rather than a JSON issue.
Here's an example in pure JSON:
"title": [
{"language": "fr", "value": "Vingt mille lieues sous les mers"},
{"language": "en", "value": "Twenty Thousand Leagues Under the Sea"},
{"language": "ja", "value": "海底二万里", "direction": "ltr"}
]
This example goes beyond what EPUB 3.x supports:
we're not limited to 2 languages, we can include as many alt representations as we need
the direction can be expressed on the alt representation as well
We just need to figure out how we could avoid parsing direction in the JSON-LD context.
The only way I see now (and I would be happy to be proven wrong) is to define the terms 'value', 'language', and 'direction' in our own "namespace" so to say, as terms defined in our own `@context` and ignore its native JSON-LD/RDF meaning. But that would not be a really good direction either I guess...
|
Wrong button @iherman. 😄 Also, introducing two disparate processing models is likely a Bad Thing. @HadrienGardeur there is a JSON-LD WG in the offing (we hope!), so now would be a great time for stating the need for text direction expression to the JSON-LD Community Group. Specifically, send an email to this mailing list https://lists.w3.org/Archives/Public/public-linked-json/ Beyond what might be available, the way the Web Annotation WG did it still seems viable, and could do with some proper consideration. |
Wrong button indeed, sorry :-( |
I am not sure that would work, worth a try with the json-ld playground. I think redefining "@value" is just cosmetics in this sense, it would not make us avoid the problem in #124 (comment). |
@BigBlueHat two separate processing models? Do you mean string, array or object for each metadata like in RWPM? It does introduce a bit of extra processing, but nothing compared to the processing of the default reading order (I'm working on this currently for the draft and it's much much worse than anything we're discussing here). @iherman I tried both options (keeping "@value" and "@language" as is, or redefining them in the context) and they both work fine in the JSON-LD playground. |
@HadrienGardeur: redefining "@value" and "@language" of course works. But the following does not:
on playground this leads to the error message:
|
@iherman don't include anything about the direction in the "@context" and it works fine in JSON-LD playground: {
"@context": {
"title": "http://schema.org/name",
"value": "@value",
"language": "@language"
},
"title": [
{"@language": "fr", "@value": "Vingt mille lieues sous les mers"},
{"@language": "en", "@value": "Twenty Thousand Leagues Under the Sea"},
{"@language": "ja", "@value": "海底二万里", "dir": "ltr"}
]
} |
Please see https://w3c.github.io/string-meta/ and coordinate needs with i18n rather than invent something new in isolation here (apologies if that conversation is already happening). |
@plinss thanks for pointing that out, we should definitely reach out to them. I just did a quick scan of the document and one of the example of best practice is almost exactly what I've proposed: https://w3c.github.io/string-meta/#bestPractices @iherman what would be the best way to coordinate with that group? |
Oh yes, you are right, I forgot about this trick. However, it is a trick which means that the resulting RDF metadata will not be proper. But we may have no other solution. Personally, I am fine if, for the time being, this is the way we go, but I would think that @BigBlueHat and I will have to raise this issue at the (hopefully upcoming) JSON-LD WG to see if there is a better solution. |
@plinss I think all the approaches we have been discussing here are in line with that document. The problem is that the document you refer to does not deal with the problem that a representation of direction cannot be done in RDF, which means it cannot (properly) been done in JSON-LD either. That being said, the RDF issue should be indeed solved outside this group, too. |
I've extracted a few points from the best practice section of https://w3c.github.io/string-meta/
The document also considers that the best practice is to use a language map + Localizable dictionary, which IMO is a little problematic:
|
@HadrienGardeur, I have raised an issue in the JSON-LD CG and also commented on the string-meta document. I would propose that, at this point:
|
I also created an additional issue at w3c/string-meta#13 |
I will also update the lifecycle branch to include the |
There is still an aspect that is badly covered by Unicode (bidi controls), which is mixing ltr and rtl scripts in a single string in a data format like JSON. I suggest considering that each metadata (even the title) will be expressed in a unique language and dir. A more complex expression of the information can be expressed as html in the content itself, which seems to be largely enough. |
Propose closing: this is now part of the latest draft (per #129). The JSON serialization may be tricky, but this should be looked at, when the time comes, via separate issues... |
The current WP infoset is fairly consistent with the WAM:
This is quite different from EPUB 3.x where each metadata can be expressed in multiple scripts/languages. Here's an example from the 3.1 spec:
Readium and the RWPM also provide support for multiple scripts/languages per property:
Since the Japanese publishing industry (ping @frivoal) told us multiple times in the past that this is very important for them, I'm wondering if the current direction for WP is on purpose or not.
The text was updated successfully, but these errors were encountered: