-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting the (text) direction #220
Comments
See also the separate discussion on the JSON-LD 1.1 CG: json-ld/json-ld.org#583 |
Another reference: https://w3c.github.io/string-meta/ |
Do you mean Unicode Bidi? http://unicode.org/reports/tr9/ |
More specifically, see json-ld/json-ld.org#583 (comment) : which referred to this. The following discussion (which was, as far as I am concerned, inconclusive) gave some pro and cons to that approach. Note that the JSON-LD CG decided to defer that issue to the JSON-LD WG which has just been formed; I hope that the discussion will re-start with some more people involved (eg, Schema.org people as well). We may want to defer this issue to see where that discussion go. |
@TzviyaSiegman reminded me that there is another approach that is perfectly viable, namely use the HTML datatype. What this means in practice is that, if a text has bidirectional issues, it could use HTML syntax and the result would be considered to be a string of an HTML datatype in RDF parlance. Here is what it would mean in JSON-LD:
(The trick is to ensure that the character '5' appears on the right hand side of a Hebrew text. If the span is not used, the number will be used as if it was part of the hebrew text and will appear on the left of it!) From an internationalization point of view, that is much better, because it gives a better control. We could therefore say that, for example for the term
|
When written as:
the google structured data tester seems to validate it. The following HTML5 document is valid in https://validator.w3.org/: <title>I AM YOUR DOCUMENT TITLE REPLACE ME</title>We find the phrase 'פעילות הבינאום' 5 times on the page. |
@laudrain oops:-) That was my mistake. But then... this looks good as a solution for direction. However, you as a potential author: how would you like it? |
I like it. Taking an example from EPUB 3.1 spec[1] with a Japanese name:
Is this correct? Even possible? [1] http://www.idpf.org/epub/31/spec/epub-packages.html#sec-shared-attrs |
@laudrain which checker did you use? I just tested something that is based on what you ask on https://search.google.com/structured-data/testing-tool and I get an error: |
(The previous example is accepted by the JSON-LD playground...) |
It is the same tool, @laudrain. However, try to add a |
https://schema.org/name is only defined as https://schema.org/Text, so it can't contain HTML. Sorry folks. |
End of the game ? |
I could argue that "text", at least in RDF land (though it is called "Literal"), may have a datatype, and this is all what the HTML stuff does but... me arguing does not make any sense, obviously. Sigh... |
@laudrain rather back to square one. Putting UTF directionality code into the text works, see the examples on the Activity Stream spec. It is just ugly and may create problems with search. |
Why problems with search? The characteristics of this code should prevent them: |
@laudrain I think (and I am a bit on a slippery slope, because am not an expert of these things) the problem is that search (or query in a database) is based on comparing unicode points, and it is way too easy to make the mistake and give a search term that does not include those extra characters. That may be the issue. This is certainly the case when doing database search in a graph database (e.g., using SPARQL). |
I am not a specialist, but my understanding is that "text search" typically operates on multiple layers of abstractions over Unicode code units and/or code points, in ways that are quite domain-specific. Typically, both the query and input strings need to be normalized (language-specific handling of accentuated characters, punctuation removal, etc.) and are subject to further heuristic interpretation (conjugation, synonyms, logical combinators, etc.) |
For language direction, this one seems ok:
but lack the language tag. |
For language direction, this one seems ok:
{
***@***.***":"http://schema.org <http://schema.org/>",
***@***.***":"Book",
"author": {
***@***.***":"Person",
"name": "Haruki Murakami",
"alternateName": "\u2067村上 春樹"
}
}
but lack the language tag.
Yes, that should work, but the missing language tag is a problem (hopefully we can sort that out with schema.org <http://schema.org/>).
The question is: how much of a drag is it for authors to add the \u2067? @TzviyaSiegman may be in a better position that others to answer this: the problem really arises when there is a mixture of left-to-right and right-to-left scripts, otherwise those tags are not necessary. (E.g., in the example above, the 'ltr' flag is not really necessary for the Japanese name of Murakami, because the Japanese characters convey that information by default in Unicode).
|
I may be offbeat, but feel that using some alternateName for xlang properties is an issue. Why would one language be a primary one and other subsidiary, as 'alternate' suggests in practice? Also, For property values in one single language (i DON'T speak about strings using a mix of LTR and RTL), don't you think that the language attribute is enough for what UA have to do, i.e. filter the proper variant and display the value? |
@llemeurfr we have not addressed this alternateName issue at all so far, this is only to explore the I18N issues...
Depends what expect from the UA for alternate names which, again, we have not discussed so far. But I believe you are right on a more general level: having the language information available is a necessity. |
@iherman, yes, my main question here is: what would be the practical use of a direction attribute for property values that are in a single language? I think the answer is none. After this is settled, if we find a way to express values containing a mix of LTR & RTL using Unicode bidi characters or any other markup, fine. |
I believe that is correct.
Again, that is correct. See #219 .
Correct again. We can indeed put forward a resolution whereby we rely on the Unicode bidi characters like the Activity Stream Recommendation does: the advantage is that we can adopt it right away and we do not hit any obstacle with JSON-LD and/or Schema.org. The disadvantage is that it is a bit complex to author the metadata... There are some people in the group who may have some experience with authoring mixed setups; would be good to hear whether that approach could work... |
Looks good. I think we need to provide some explanation around "und". We could take that from Activity Streams too. |
I'm not fan of the names proposed for these properties ( "defaultTextDirection" and "defaultTextLanguage") but this is a bikeshedding detail that can be treated later. |
Not to bikeshed, but for a bit of brevity could we just use textDirection and textLanguage? Default-iness can be determined from the description. Otherwise, looks fine to me. |
I'm certainly not bound to those names. |
Unfortunately, I realized that I have fallen into a trap, and the proposed solution for the default direction is not really clean:-( The problem is with the semantics of what JSON-LD/Schema.org really expresses. In general, when we have, in the manifest, something like
What that means, in English, is that
Ie, every statement is something we say about the publication with the identifier (or address). However, when we have a statement like Expressing all this properly, though possible, would involve other notions in JSON-LD (i.e., Datasets) that are (a) probably too complicated for most of our users/readers and (b) probably not understood by the schema.org processors. We should not go down that route, imho. Sigh. I can see two approaches:
Under the adage that usability and authors'/users' interest has a higher priority than theoretical purity, I am mildly in favor of (2) above. But if we do that, we have to realize what is happening, ie, that we are cheating... (My apologies not to have realized this when I made the proposal.) |
EPUB does allow the default directionality to be specified through the dir attribute on the package element. You can also override it on each text-carrying element. The problem with minting stuff ourselves is that we'll be stuck supporting it for as long as the format exists. It might be useful to add our own solution and highlight it as an issue we need feedback on in the next working draft. |
@mattgarrish that is fine. Unless there are major objections you should the add a note to the draft (maybe also referring to the problems outlined above) and merge to the main branch... (I’m on vacations fir 10 days, I won’t do it now...) |
Reading the https://w3c.github.io/wpub/#language-and-dir section with fresh eyes, I feel that we'll face a huge misunderstanding of what these 2 properties are for, from implementers. So I would rather suppress the whole section and state that the language of the metadata will be inferred from the language of the book itself (i.e. the content), unless specified on the metadata value itself. This is short and pragmatic (the border between content and metadata is thin). And we must acknowledge that there is no perfect solution today on the Web (and in JSON-LD) for expressing the base direction of metadata values in edge cases, therefore we'll stick with https://w3c.github.io/string-meta/ recommendations and JSON-LD specification. |
@llemeurfr, I just want first to have a clear understanding of what you propose. Is it so that:
Provided this is indeed what you propose, my 2 cents:
|
@llemeurfr is it o.k. if I prepare a separate draft (not necessarily a PR yet) that is based on the idea that the language/dir is inherited from the primary entry page, and we can then look at that? Thinking about it further since yesterday this may be a much better option indeed, with the least of the semantic issues... If you are fine, I can try to do this before our call on Monday. Cc @mattgarrish |
@iherman this is not what I have in mind. I'll try to express it in a clearer manner:
nb: I would be against point 1 in your list, the inference is too remote. |
We never did resolve that issue - how epub uses dc:language for the publication and xml:lang for the package metadata values. If we require that the first language code listed be the default language of the publication and manifest values (i.e., the property is either a single value or an array of values), then it probably makes as much sense as any other approach for now. |
@llemeurfr that is indeed radically different, just as I got to like 'inheriting' the language/dir settings from the HTML level...:-) However... I see a serious problem with what you propose. You give a primary role in setting the language for the manifest. However, that information will be invisible to vanilla (ie, not WP aware) browsers. This means that the language for the real (HTML) content will be considered as "und" unless the language is set on an HTML element as well. A source of redundancy. And then, of course, we may have an issue if the two are in conflict: english is set in the manifest and french in the content. What happens then? Unfortunately, for me, that is a serious flaw and I would not be in favour of that approach... I would actually argue for what I thought you had convinced me about:-): The case of the embedded manifest is particularly attractive: the language and direction is set on the, say, It is indeed a bit more 'distant' in the case of a separate manifest file but, there again, we could say that the language and dir on the In both cases the advantage is that a vanilla browser understands the language setting from the HTML, ie, there will be no possible discrepancy in the rendering. That is a major plus. (And is better than the current draft, actually!) |
@iherman I consider it required to set the language on each HTML resource individually, as it is the practice on the Web. Voice engine and other tools will make good use of it. @mattgarrish I agree that it should be the first language value, as Jiminy advocated in its internationalization paper. |
Yes, the language specified in the manifest is not used to set the language of the resources, just as it isn't in EPUB. It's there to provide context. The usual examples are to preload tts languages, offer to download dictionaries, etc. |
... a very good editorial note to add to the spec of this language property. |
But isn't against what you propose, @llemeurfr ? The language specified in the manifest is, in your proposal, considered to be the language of the content, too. Ie, it does (much) more than setting the text in the context... Even if we consider the possible conflict as a negligible issue I think that we would introduce a source of further confusion. And, per @mattgarrish
ie, what you propose would be the contrary of what EPUB does... |
@iherman no, it's "a language of the publication" and by inference also the default language of descriptive metatada if in first position in a list. If there is only one publication language and its not what the UA finds when getting the language of html resource, there is an editorial discrepency. But so what? |
I am trying to see what you propose (putting aside how this should be edited into the document).
An alternative to (3) is that we do introduce our own term for Does this reflect your proposal? If so, we do have two fairly distinct proposals to (finally) close this issue: this one, and the one I described in #220 (comment) |
@iherman, items 1,2 and 3 reflect my position, yes (thank you for pointing at inLanguage). Re. the alternative to 3 you're proposing, my issue is that I don't know what a |
It is the same as the |
Indeed, because in EPUB-land, some people assume that you only have to set the one in the manifest and you’re good to go. And resources are then missing |
Actually, @llemeurfr (and others): there may be a discrepancy between the cases when the JSON-LD is embedded via a Indeed, when it is an embedded resource, there are some general questions on what the JSON-LD "inherits" from its surroundings. I raised the issue a while ago in the JSON-LD WG on what the base URL is for embedded JSON-LD (which is a question of relevance for WP manifest, too) and, though it seems logical that the document URL is the one, this is not strictly defined in JSON-LD 1.0 (hopefully it will be for JSON-LD 1.1, taking into account that embedded JSON-LD is the format understood by schema.org processors). "Inheriting" the default language would fall in the same category. In other words, in the case of an embedded URL the "inheritance" from the primary entry page seems to be the natural move. Could there be a small difference between the two? ie,
WDYT? |
There is a danger if the behavior of a "detached" manifest is different from the behavior of an "embedded" manifest. A manifest should be attachable/detachable with no modifications. |
The Working Group just discussed The full IRC log of that discussion<dauwhe> Topic: publishing new draft<dauwhe> ... we have a few open issues <tzviya> https://github.com//issues/261 <dauwhe> Github: https://github.com//issues/261 <dauwhe> ... this is cover vs cover-image <dauwhe> ... look at last comment from Matt <dauwhe> ... we concerned about the infoset <dkaplan3> q+ <dauwhe> ... Matt says we should be concerned with language <dauwhe> ... so we're just discussing changing language <tzviya> ack dkaplan3 <dauwhe> ... should we say cover or cover image or cover page <tzviya> ack dkaplan3 <dauwhe> dkaplan3: the one thing that has happened in github <ivan> zakim, who is here? <Zakim> Present: dauwhe, ivan, tzviya, wolfgang, Juan_Corona, jbuehler, Avneesh, JuanCorona, wendyreid, dkaplan, laudrain, JunGamo, Hadrien, makoto, jpyle, josh, gpellegrino, George, <Zakim> ... BenWaltersMS, Franco, caitlingebhard, laurentlemeur, duga, marisa <Zakim> On IRC I see marisa, derekjackson, ReinaldoFerraz, lsullam, rkwright, duga, laurentlemeur, Franco, caitlingebhard, BenWaltersMS, Makoto, josh, cmaden2, Hadrien, JunGamo, EvanOwens, <Zakim> ... laudrain, wendyreid, JuanCorona, jbuehler, George, Karen, dkaplan3, Avneesh, RRSAgent, Zakim, ivan, wolfgang, dauwhe, tzviya, plinss, Rachel, github-bot, astearns, bigbluehat, <dauwhe> ... the people who wanted a discrete cover page <Zakim> ... jyasskin <dauwhe> ... I think the people in github would be fine with cover image <dauwhe> ... when I gave the whole "here are some guidelines" thing <dauwhe> ... I think people bring up stuff that doesn't need to be in the infoset <harriett> + <dauwhe> ... it's fine to document these extra things <tzviya> q? <tzviya> ack dkaplan <dauwhe> ... so we should go back to the github issue later <dauwhe> ... I think my comment addressed everything except for the infoset Q about a cover that is not a cover image <dauwhe> tzviya: perhaps we can open a new issue <dauwhe> ... the proposal that you had, can you sum it up? <dauwhe> dkaplan3: my proposal for infoset purposes <dauwhe> ... I was going based on the assumption that because <dauwhe> ... Ivan reminded us that at the F2F there needed to be the idea of a cover, that might not be image <dauwhe> ... I don't think we need both cover and cover-image <josh> q+ <dauwhe> ... but if people feel strongly about a cover that is not an image that still needs to be in the infoset <dauwhe> ... the reason people want cover images in infoset is for shelf view, etc <dauwhe> ... that reasoning doesn't apply to a cover <laurentlemeur> q+ <tzviya> ack josh <dauwhe> ... will anyone go to bat for needing a non-image cover IN THE INFOSET <dkaplan3> q+ <dauwhe> josh: I would make a strong case for a cover that's not an image because not all content includes imagery <dauwhe> ... just point to something, and if it's an image then great, if not they could render the html <dkaplan3> Josh: see https://github.com//issues/261#issuecomment-406696836 <dauwhe> ... for scholarly articles, the cover would be title/author/ journal / issue <dkaplan3> This comment specs out all of that. <dauwhe> tzviya: that's already been mentioned in an issue <dauwhe> josh: but there are 70 comments <tzviya> ack laurentlemeur <dauwhe> ... I don't think we should have both a cover and cover image <dauwhe> laurentlemeur: we should close issue by saying we define cover-image <dauwhe> ... discuss elsewhere if we need another type of cover <dauwhe> ... user agents could assemble image from metadata, wouldn't need html <dauwhe> tzviya: josh proposed just cover <dauwhe> ... the publisher can include image OR text in html <dauwhe> ... the user agent would do some magic to display <tzviya> ack dkaplan <dauwhe> laurentlemeur: I think the magic to display HTML is more than magic to assemble from metadata <dauwhe> dkaplan3: I put a link to my github comment <dauwhe> ... for later, when we are writing recs for what UAs should do <dauwhe> ... we will need to have guidelines for what to do when you don't have a cover <dauwhe> ... I'm happy with not having both <dauwhe> ... the diff between Laurent and Josh <dauwhe> ... in the absence of an image, do we recommend the UA wants to extract metadata and make cover? <dauwhe> ... or do we think UAs should tried to define a text cover somehow <dauwhe> ... I would go with Laurent <dauwhe> ... it's a standard practice now that you get title/creator in shelf view if there's no cover <wendyreid> q+ <dauwhe> ... if your business case is that it's important to have specific information on the cover, then you should probably actually create an image <tzviya> ack wendyreid <dauwhe> wendyreid: from experience with Kobo, that's what we do <dauwhe> ... if we have image we use it <ivan> q+ <dauwhe> ... if there's no image file, then we create cover with metadata <dauwhe> ... that's very standard <tzviya> ack ivan <dauwhe> ivan: I don't understand the controversy <dauwhe> ... I thought Josh's proposal was fine <dauwhe> ... there's a cover, if you put image there you get image, if HTML is there you render that <dauwhe> ... and in the scholarly world, title and author might not be enough <dauwhe> ... you need standard metadata <josh> +1 to Ivan expressing my business case better than I did. <laurentlemeur> q+ <dauwhe> ... I don't understand the problem <tzviya> ack laurentlemeur <dauwhe> tzviya: a reminder that we're only talking about the infoset <dauwhe> laurentlemeur: if we follow josh, it means every UA will have to be able to take an arbitrary HTML file or something else and try to make a cover out of it <josh> q+ <dauwhe> ... this puts a burden on user agents <tzviya> ack josh <dauwhe> josh: I think that UAs should do what they think is best <dauwhe> ... using this approach, you have something called a cover that points to image or file <dauwhe> ... if UA doesn't know how to turn html into shelf-view icon, it can still use metadata <dauwhe> ... we should provide as much guidance as possible to UA, then let UA choose <George> q+ <tzviya> ack George <dauwhe> tzviya: we might need to decide to publish without this <dauwhe> George: just an image is too limiting in terms of looking at the future <dauwhe> ... I see discussions about VR and innovation in the book space <dauwhe> tzviya: [1] include cover which could be anything or [2] just a cover image ? <laurentlemeur> 2 <dkaplan3> 2 <JuanCorona> 1 <Hadrien> 2 <ivan> 1 <tzviya> 1 <caitlingebhard> 1 <josh> 1 <wendyreid> 1 <derekjackson> 1 <wolfgang> 1 <Franco> 1 <rkwright> 2 <gpellegrino> 2 <laudrain> 2 <George> 1 cover <lsullam> 1 <clapierre> 1 <marisa> 1 <jbuehler> 1 <MustlazMS> 1 <rkwright> 1 <dauwhe> tzviya: I see more 1s than 2s <ivan> q+ <rkwright> I inadvertently entered a 2. <tzviya> ack ivan <dauwhe> ivan: I would propose to put there the more permissive approach 1, publish a draft (which isn't final) <dauwhe> ... and see what the community has to say <dauwhe> ... we don't have unanimity <dauwhe> ... this is just a draft <dauwhe> ... easier to restrict early <tzviya> https://github.com//issues/220 <dauwhe> github: end topic <dauwhe> github: https://github.com//issues/220 <dauwhe> ivan: we had quesiton of direction ltr rtl <dauwhe> ... trouble expressing in JSON <dauwhe> ... consensus in discussion that rtl/ltr we don't have other means than fallback on unicode directional markers <dauwhe> ... and no explicit default direction setting <dauwhe> ... i think laurentlemeur we have agreement <dauwhe> laurentlemeur: yes <dauwhe> ivan: a more general issue came up <dauwhe> ... related to the language setting <dauwhe> ... there are 2 different things <dauwhe> ... 1. if I set language in manifest in one of schema.org terms inLanguage <dauwhe> ... this means I set language for publication at large + text of manifest <dauwhe> ... individual resournces may not set their own languages, there may be discrepencies <dauwhe> ... 2 the other appraoch is more complicated <dauwhe> ... when the manifest is embedded in html <tzviya> q+ <dauwhe> ... looking at that case it would be logical that the script element with manifest inherits language and dir of entry page <wolfgang> s/appraoch/approach/ <dauwhe> ... so if it's part of html then I could and maybe should refer to what HTML does <dauwhe> ... we still say that if you do it that way you're talking about the publication as a whole <dauwhe> ... when we have an embedded manifest, do we inherit the HTML settings? <dauwhe> ... or not? <dauwhe> ... an additional argument... this is one thing from HTML structure that we will inherit <dauwhe> ... this is the base URL <tzviya> ack tzviya <laurentlemeur> q+ <dauwhe> tzviya: I put myself on the queue <dauwhe> ... the schema.org group is aware they have language issues, but they're trying to work it out <dauwhe> ... they know language on particular tags is a problem <dauwhe> ivan: yes, the setting of a langague for an individual text is already there <dauwhe> ... we hope schema.org handles it eventually <wolfgang> s/langague/language/ <dauwhe> ... the json-ld working group, partly on my instigation, is looking at issues of embedded json-ld <dauwhe> ... it was non-normative in 1.0 <dauwhe> ... for example, there's no resolution on inheriting baseURL <tzviya> ack laurentlemeur <dauwhe> ... I think that will be resolved in 1.1 <dauwhe> laurentlemeur: in fact, here we are trying to do 2 things <dauwhe> ... the language of publication is descriptive metadata <dauwhe> ... if we want to infer the language of manifest, that's simple <dauwhe> ... we can do that in json-ld with @language in context <dauwhe> ... so we are trying to simplify work of authors <dauwhe> ... by inferring from publication language <dauwhe> ... and maybe we should inherit from html if manifest is embedded <dauwhe> ... but that makes processing of detached and embedded manifests different <dauwhe> ... this is why we should infer language from publication language <dauwhe> tzviya: what q are we answering? <dauwhe> ivan: the current text needs to be rewritten <dauwhe> ... at least for embedded version there are two ways of rewriting <dauwhe> ... if you want detachable things then use @language <dauwhe> ... we have two difficulties, i agree with this one <dauwhe> ... if we ignore surrounding html we will end up defining something which is not aligned with how json-ld is used in html <dauwhe> ... I don't know which one is a bigger danger <dauwhe> tzviya: the Q is, whether we use something we are sure will work detached or embedded, but might overwrite default/be in conflict with processors <dauwhe> ivan: I dont think this is correct <dauwhe> .... if you use @language it works everywhere <dauwhe> ... if I don't put anything in JSON_LD or any other thing, what happens then? <dauwhe> ... in one case a language might be inherited, in the other case not <laurentlemeur> q+ <tzviya> ack laurentlemeur <dauwhe> ivan: @language is so far away from the standard syntax for authors; extending a context is very difficult to follow <dauwhe> laurentlemeur: the context line in json-manifest should be copy/paste; should not be edited <dauwhe> ... it's not about metadata, not about structure <dauwhe> tzviya: can we come to consensu? <dauwhe> ivan: we should do a PR knowing there are issues, and we don't know <dauwhe> ... I can try to write up the more complex situation and see where it goes <dauwhe> laurentlemeur: let's try to write it <dauwhe> ivan: I will come up with a PR, hopefully this week |
This is a completely open issue at this moment, both for JSON-LD and Schema.org... The only (incomplete) approach would be to rely on, and base everything, on the UTF-encoding of the text...
The text was updated successfully, but these errors were encountered: