Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expressing metadata in multiple scripts/languages #124

Closed
HadrienGardeur opened this issue Jan 24, 2018 · 37 comments
Closed

Expressing metadata in multiple scripts/languages #124

HadrienGardeur opened this issue Jan 24, 2018 · 37 comments

Comments

@HadrienGardeur
Copy link
Member

The current WP infoset is fairly consistent with the WAM:

  • there's a single language element meant to indicate the language of all properties expressed in the manifest
  • same thing for the direction

This is quite different from EPUB 3.x where each metadata can be expressed in multiple scripts/languages. Here's an example from the 3.1 spec:

<dc:creator opf:alt-rep-lang="ja" opf:alt-rep="村上 春樹">
    Haruki Murakami
</dc:creator>

Readium and the RWPM also provide support for multiple scripts/languages per property:

"author": {
  "name": {
    "ru": "Михаил Афанасьевич Булгаков",
    "en": "Mikhail Bulgakov",
    "fr": "Mikhaïl Boulgakov"
  }
}

Since the Japanese publishing industry (ping @frivoal) told us multiple times in the past that this is very important for them, I'm wondering if the current direction for WP is on purpose or not.

@TzviyaSiegman
Copy link
Contributor

@murata2makoto please comment on behalf of JEPA

@murata2makoto
Copy link

Expressing each piece of metadata (e.g., titles and author names) in multiple scripts (CJK ideographics and some phonetic script (e.g, Kana)) is a must for Japanese. This is because there are multiple and equally reasonable phonetics for some names. For example, 智子 might be pronounced as either Tomoko or Satoko. We really do not know.

@HadrienGardeur
Copy link
Member Author

Thank you @murata2makoto for this feedback.

I think that we need to carefully reconsider what's in our infoset. Here are a few notes:

  • the requirements for Japanese are incompatible with the design of the WAM as well
  • RWPM allows more than one "alternative representation" which should be a perfect fit (see example above)

I'll also open a separate issue about reading direction, this is a related problem but it's more difficult to solve IMO.

@iherman
Copy link
Member

iherman commented Jan 30, 2018

My approach would be:

  • It is o.k. to have the current language+direction setting as a default for the metadata. This can cover many of the use cases at least in US and in Europe.

  • (This is still hazy) everywhere where we would have an essentially textual value for the information item, we should allow to have, instead, a structure consisting of a value, language tag and direction. This is fine in simple JSON, I still have to see how this can properly be done (maybe with some JSON-LD 1.1 features) in JSON-LD.

I think we should include the consideration on directions into this issue right from the start.

@HadrienGardeur
Copy link
Member Author

HadrienGardeur commented Jan 30, 2018

RWPM has a slightly different approach:

  • at a manifest level, the language is meant to express the language of the overall publication rather than its metadata (this is useful to preload a dictionary or handle search better for instance)
  • it is assumed that unless another language is specifically indicated per element, that this language applies as well

In the RWPM context we're currently using a language map, which works fine when we only define a language but couldn't include a direction as well (it MUST use a string or an array of string).

This is an example on a title:

"title": {
  "fr": "Vingt mille lieues sous les mers",
  "en": "Twenty Thousand Leagues Under the Sea",
  "ja": "海底二万里"
}

We could of course adopt a different approach instead of a language map, for example:

"title": [
  {
    "@value": "Vingt mille lieues sous les mers", 
    "@language": "fr"
  },
  {
    "@value": "Twenty Thousand Leagues Under the Sea", 
    "@language": "en"
  }
]

This would allow the inclusion of an additional direction as well (ignored by RDF parsers). Are these two semantically the same from an RDF output perspective @iherman?

@iherman
Copy link
Member

iherman commented Jan 30, 2018

@HadrienGardeur yes, this is roughly what I had in mind, and I believe what you wrote are equivalents, but adding the direction may lead to some problems. Can we keep this in hold and I would look at this later this week?

@iherman
Copy link
Member

iherman commented Jan 31, 2018

@HadrienGardeur
The problem is that the following JSON-LD:

  "title" :
 {
    "@value": "Vingt mille lieues sous les mers", 
    "@language": "fr",
   "dir":"ltr"
  },

is invalid JSON-LD. Which means that, e.g., the JSON-LD playground rejects it.

We could come up with a hack. Another possibility is that we wait for the JSON-LD WG to be formed and raise an issue. Yet another is that we raise an issue with the CG that delivers JSON-LD 1.1. In any case, I would feel bad coming up with some sort of a hack ourselves...

@BigBlueHat
Copy link
Member

This is why I like HTML. 😸

@iherman
Copy link
Member

iherman commented Jan 31, 2018

Well, actually... you gave me an idea. We may go in this direction (but not necessarily the way you think it:-).

The problem with the whole directionality is when things get mixed up; ie, when the language itself is not enough to a proper interpretation of the characters and the BIDI algorithm needs some extra "help". The texts of @r12a like bidirectional text in HTML or (the bidi algorithm description](https://www.w3.org/International/articles/inline-bidi-markup/uba-basics) do a much better job in explaining, better than I will ever be able to do. The situation may come up in titles.

The problem, if we use JSON-LD, is that RDF strings do not have built in constructions to handle that. The only thing you can do is to assign a language or (and that is an exclusive "or") a special datatype. On the other hand, HTML gives all the tools that are necessary to describe the intricacies which, let us face it, are not the majority of our usages.

But... RDF (and therefore JSON-LD) offers a hack: the rdf:HTML datatype. Essentially, it says "consider this text an HTML fragment, and interpret it accordingly". But then, it is perfectly possible to do the following in JSON-LD (using an example from Richard's text:

{
  "@context": {
    "rHTML" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML",
    "ex" : "http://example.org/"
  },
  
  "ex:title" : {
    "@value" : "<p>The title is <cite dir="rtl">مدخل إلى <span dir="ltr">C++</span></cite> in Arabic.</p>",
    "@type":"rHTML"
  }
}

This perfectly fine JSON-LD (and RDF). It feels and smells like a hack, but may give a direction (sic!) for thoughts...

(I was wondering about a slightly different direction, namely to define some parametrized datatypes that would combine a language tag and a writing direction, but we would end up reproducing the HTML semantics...)

Cc: @r12a


(Note that JSON-LD 1.1 has a type-based indexing just like the indexing along language tags, as used in the example of @HadrienGardeur. We may have a use case for a 'datatype' indexing for the evolution of JSON-LD 1.1)

@HadrienGardeur
Copy link
Member Author

I really dislike this solution, it will make things extremely complicated for User Agents and I don't think it's making things easier for authoring either.

Forcing UAs to implement on every string:

  • HTML parsing
  • white listing HTML elements
  • HTML entities decoding
  • most likely a sanitize helper to get rid of all sorts of things
  • plus mapping HTML to whatever the native platform is capable of

... is not my definition of a good solution for a super specific problem.

We've already solved the issue of multiple languages/scripts with the language map in RWPM, I'd rather wait for the UTF-8, RDF or JSON-LD community to solve this issue with reading direction than implement a hack that will deeply impact everything we do.

@BigBlueHat
Copy link
Member

@iherman interesting approach. We didn't quite go that far with Web Annotation, but certainly spec'd out how one would express HTML in JSON-LD as well as textDirection for non-HTML content, etc:
https://www.w3.org/TR/annotation-model/#embedded-textual-body

There are several different ways to model this for moving HTML around inside JSON-LD. However, I think the whole thing has a bad smell.

What's regrettable is that we are re-recording information which is also likely to have HTML representation. Because of the needs of i18n (which are legitimate needs!), we're now likely to need to put HTML into JSON...

Consequently, I think we need to get the JSON out of the way, and reappraise the needs of an "infoset" serialization.

HTML is far more expressive, human readable, extensible, displayable, and has far more multi-lingual work done for it. JSON has nearly none of those features, and is likely simply to be used as a "transport" format ultimately to be put back into some HTML-based UI.

I'd like us to reconsider the decision made in #7, because I'm hopefully the reasoning behind using HTML (rather than JSON) are increasingly clear to more folks.

@iherman
Copy link
Member

iherman commented Jan 31, 2018

@HadrienGardeur I did not say I like it:-) But I do not see, at this moment, any better solution within the existing specifications. As I mentioned, my other option was to define a number of datatypes and use those, but that would require an extra specification work and to get at least some of the RDF environments to accept the datatypes.

If we use JSON but not JSON-LD, then the problem does not arise, in fact. We can easily add a direction to any structure, and the only complication for a JSON parser would be to accept, for a key, either a string or an object that includes a string value with additional information about it. A pretty standard way of operating in the JSON world.

@HadrienGardeur
Copy link
Member Author

BTW, I've looked at what we have in EPUB right now, and while it partially solves the problem for language, it doesn't handle the issue completely for direction:

<package dir="ltr">
  <metadata>
    <dc:creator opf:alt-rep-lang="ja" opf:alt-rep=" 樹春上村">Haruki Murakami</dc:creator>
  </metadata>
</package>

I can't express the direction of the opf:alt-rep if it's different from package or dc:creator.

It does provide a little more flexibility than our infoset though, since a number of elements allow dir for their text node (but not their attributes).

@BigBlueHat
Copy link
Member

@iherman curious how you see this being that much "cleaner" in JSON vs. JSON-LD.

@iherman
Copy link
Member

iherman commented Jan 31, 2018

I can do in JSON what I cannot do in JSON-LD, see #124 (comment). What stands in a way is the JSON-LD restrictions or, to be more precise, the RDF restrictions...

@BigBlueHat
Copy link
Member

@iherman that looks a lot like the format we made for Web Annotation (which is JSON-LD): https://www.w3.org/TR/annotation-model/#example-4

{
  "@context": "http://www.w3.org/ns/anno.jsonld",
  "id": "http://example.org/anno5",
  "type": "Annotation",
  "body": {
    "type" : "TextualBody",
    "value" : "<p>j'adore !</p>",
    "format" : "text/html",
    "language" : "fr"
  },
  "target": "http://example.org/photo1"
}

What am I missing here? 😃

@HadrienGardeur
Copy link
Member Author

@iherman you're 100% right that this is a JSON-LD issue rather than a JSON issue.

Here's an example in pure JSON:

"title": [
  {"language": "fr", "value": "Vingt mille lieues sous les mers"},
  {"language": "en", "value": "Twenty Thousand Leagues Under the Sea"},
  {"language": "ja", "value": "海底二万里", "direction": "ltr"}
]

This example goes beyond what EPUB 3.x supports:

  • we're not limited to 2 languages, we can include as many alt representations as we need
  • the direction can be expressed on the alt representation as well

We just need to figure out how we could avoid parsing direction in the JSON-LD context.

@HadrienGardeur
Copy link
Member Author

@iherman I just tried the following example in JSON-LD playground and it works fine:

{
  "@context": {"title": "http://schema.org/name"},
  "title": [
    {"@language": "fr", "@value": "Vingt mille lieues sous les mers"},
    {"@language": "en", "@value": "Twenty Thousand Leagues Under the Sea"},
    {"@language": "ja", "@value": "海底二万里", "dir": "ltr"}
  ]
}

It ignored dir but the RDF output looks fine.

@iherman
Copy link
Member

iherman commented Jan 31, 2018 via email

@iherman iherman closed this as completed Jan 31, 2018
@HadrienGardeur
Copy link
Member Author

@iherman we already redefine "@id" in the RWPM default context, but that's entirely for cosmetics.

We could also use "@id", "@value" and "@language" as-is and everything would work fine.

Have you closed this issue on purpose?

@BigBlueHat
Copy link
Member

Wrong button @iherman. 😄

Also, introducing two disparate processing models is likely a Bad Thing.

@HadrienGardeur there is a JSON-LD WG in the offing (we hope!), so now would be a great time for stating the need for text direction expression to the JSON-LD Community Group. Specifically, send an email to this mailing list https://lists.w3.org/Archives/Public/public-linked-json/

Beyond what might be available, the way the Web Annotation WG did it still seems viable, and could do with some proper consideration.

@iherman
Copy link
Member

iherman commented Jan 31, 2018

Wrong button indeed, sorry :-(

@iherman
Copy link
Member

iherman commented Jan 31, 2018

@iherman we already redefine "@id" in the RWPM default context, but that's entirely for cosmetics.

We could also use "@id", "@value" and "@language" as-is and everything would work fine.

I am not sure that would work, worth a try with the json-ld playground. I think redefining "@value" is just cosmetics in this sense, it would not make us avoid the problem in #124 (comment).

@HadrienGardeur
Copy link
Member Author

@BigBlueHat two separate processing models? Do you mean string, array or object for each metadata like in RWPM?

It does introduce a bit of extra processing, but nothing compared to the processing of the default reading order (I'm working on this currently for the draft and it's much much worse than anything we're discussing here).

@iherman I tried both options (keeping "@value" and "@language" as is, or redefining them in the context) and they both work fine in the JSON-LD playground.

@iherman
Copy link
Member

iherman commented Jan 31, 2018

@HadrienGardeur: redefining "@value" and "@language" of course works. But the following does not:

{
  "@context" : {
  "language": "@language",
  "value": "@value",
  "direction" : "http://ex.org/direction",
  "title" : "http://ex.org/title"
   },
  "title" : {
    "value" : "something",
    "language": "en",
    "direction": "ltr"
  }
}

on playground this leads to the error message:

jsonld.SyntaxError: Invalid JSON-LD syntax; an element containing "@value" may only have an "@index" property and at most one other property which can be "@type" or "@language".

@HadrienGardeur
Copy link
Member Author

@iherman don't include anything about the direction in the "@context" and it works fine in JSON-LD playground:

{
  "@context": {
    "title": "http://schema.org/name", 
    "value": "@value", 
    "language": "@language"
  },
  "title": [
    {"@language": "fr", "@value": "Vingt mille lieues sous les mers"},
    {"@language": "en", "@value": "Twenty Thousand Leagues Under the Sea"},
    {"@language": "ja", "@value": "海底二万里", "dir": "ltr"}
  ]
}

@plinss
Copy link
Member

plinss commented Jan 31, 2018

Please see https://w3c.github.io/string-meta/ and coordinate needs with i18n rather than invent something new in isolation here (apologies if that conversation is already happening).

@HadrienGardeur
Copy link
Member Author

@plinss thanks for pointing that out, we should definitely reach out to them.

I just did a quick scan of the document and one of the example of best practice is almost exactly what I've proposed: https://w3c.github.io/string-meta/#bestPractices

@iherman what would be the best way to coordinate with that group?

@iherman
Copy link
Member

iherman commented Feb 1, 2018

@HadrienGardeur

@iherman don't include anything about the direction in the "@context" and it works fine in JSON-LD playground

Oh yes, you are right, I forgot about this trick. However, it is a trick which means that the resulting RDF metadata will not be proper. But we may have no other solution.

Personally, I am fine if, for the time being, this is the way we go, but I would think that @BigBlueHat and I will have to raise this issue at the (hopefully upcoming) JSON-LD WG to see if there is a better solution.

@iherman
Copy link
Member

iherman commented Feb 1, 2018

@plinss I think all the approaches we have been discussing here are in line with that document. The problem is that the document you refer to does not deal with the problem that a representation of direction cannot be done in RDF, which means it cannot (properly) been done in JSON-LD either.

That being said, the RDF issue should be indeed solved outside this group, too.

@HadrienGardeur
Copy link
Member Author

I've extracted a few points from the best practice section of https://w3c.github.io/string-meta/

  • Each localizable string of metadata should be turned into a "Localizable" dictionary that can contain a language and/or a direction
  • We may provide a default language and direction
  • The name @language is RECOMMENDED as the name of the default language value and @dir as the default direction value.

The document also considers that the best practice is to use a language map + Localizable dictionary, which IMO is a little problematic:

  • there's redundancy between the key of the language map and the language in the object
  • I don't see how the example that they provide can work with JSON-LD, it returns an error in the JSON Playground

@iherman
Copy link
Member

iherman commented Feb 1, 2018

@HadrienGardeur, I have raised an issue in the JSON-LD CG and also commented on the string-meta document.

I would propose that, at this point:

  1. we should modify the draft to make clear that every metadata item should be "localizable", to use the terminology of the string-meta document. (That needs a PR on the draft that I will come up with at some point.)
  2. If we indeed use JSON-LD we use, temporarily, the trick you have above, with the hope that JSON-LD 1.1 will make it official, eventually.

@BigBlueHat @plinss

@HadrienGardeur
Copy link
Member Author

I also created an additional issue at w3c/string-meta#13

@HadrienGardeur
Copy link
Member Author

I will also update the lifecycle branch to include the Localizable dictionary in WebIDL, but this means that we can't rely on the ES to WebIDL dictionary algorithm for metadata.

@llemeurfr
Copy link
Contributor

There is still an aspect that is badly covered by Unicode (bidi controls), which is mixing ltr and rtl scripts in a single string in a data format like JSON.
Such representation of information is rare, difficult to author / manage in a database and display in applications (eg. using native code). But it's fairly easy to manage as html, as we said.

I suggest considering that each metadata (even the title) will be expressed in a unique language and dir. A more complex expression of the information can be expressed as html in the content itself, which seems to be largely enough.

@iherman
Copy link
Member

iherman commented Mar 2, 2018

Propose closing: this is now part of the latest draft (per #129). The JSON serialization may be tricky, but this should be looked at, when the time comes, via separate issues...

@iherman
Copy link
Member

iherman commented Mar 13, 2018

@iherman iherman closed this as completed Mar 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants