-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate content of <dc:language> #702
Comments
While at a first glance this looks easy to implement, it gets harder when you look at the RFC5646 spec and not only in the EPUB example: https://tools.ietf.org/html/rfc5646#appendix-A Possibly allowed language tags:
To be honest: That's a validation nightmare! And I don't see a quick chance to built a validation engine for that... In fact, It could also be that your example Removing this from the "Next" milestone for the moment... note to myself: IANA Language Subtag Registry: http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry |
Does the simple type xsd:language address this problem? |
Looking at the examples at http://www.datypic.com/sc/xsd/t-xsd_language.html this seems indeed a good way to go! I only looked at this from a Java perspective, but not from the schema validation point of view... However, when looking at the specs, EPUB->OPF->DublinCore requires RFC5646 which obsoletes the RFC spec XML Schema is defining, right? So the DublinCore meta date may allow more valid language codes than XML schema can validate, although I don't have an example for that. However, if @mattgarrish as our spec-guru agrees, I would give this a go and change the schema datatype to xsd:language. |
The schemas already enforce xsd:language constraints:
But that just enforces the lexical constraint without trying to verify the validity of the segments. The request, as I understand it, is to go further and validate the segments. It would be great if that were done, but it seems like no small task and a perpetual moving target. |
It would be nice if meaningless tags such as en-US-POSIX are detected. But if some programming (as oppose to schema hacking) is required, I am not sure if this is important enough. |
Update: @kalaspuffar started working on this in PR #807. Review of the PR is welcome. |
Unless we check the IANA registry, I don't think there's much we can do here more than the lexical check performed by the schema? |
Yes, checking if language tags are valid requires access to or a copy of the registry. I didn't check EPUB 3.2, but the EPUB 3.0 spec text in the first comment didn't say if it requires the language tag to be well-formed or valid. The LTLI document from W3C i18n WG contains some guidance on this. |
We had a long discussion about well-formed v. valid for web publications and the resulting consensus was that there is little value in enforcing validity. Reading systems will react or not based on whether they recognize the language, so ensuring the general pattern is followed is all that is necessary. This really should be clarified in the epub spec. |
In Package Document, the language tags appearing in the elements or attributes below MUST be well-formed according to BCP47: - `xml:lang` attribute - `hreflang` attribute - `dc:language` element For these values: - the schema now only do basic datatype check (string, non-empty value when relevant) - the well-formedness is checked with Java’s Locale.Builder#setLanguageTag() API - a new check (OPF-092) is reported when an ill-formed value is found See https://docs.oracle.com/javase/8/docs/api/java/util/Locale.Builder.html#setLanguageTag-java.lang.String- Fix #1221 Close #702
epubcheck doesn't check dc:language value!
According with specification
Every metadata section must include at least one language element with a value conforming to [RFC5646].
content.opf of my ePUB after export from Adobe InDesign
<dc:language>en-US-POSIX</dc:language>
doesn't have a valid value and epubcheck ignores that.epubcheck output:
The text was updated successfully, but these errors were encountered: