Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

language stem should respect langMatches semantics #71

Closed
VladimirAlexiev opened this issue Sep 5, 2017 · 9 comments
Closed

language stem should respect langMatches semantics #71

VladimirAlexiev opened this issue Sep 5, 2017 · 9 comments

Comments

@VladimirAlexiev
Copy link

VladimirAlexiev commented Sep 5, 2017

The following shape:
:SpanishProduct { schema:label [ @es~ ] }
Declares that products must have a label in Spanish or any variant of it (eg es-ES vs es-AR).

But LanguageStem is defined as simple prefix match (http://shex.io/shex-semantics/#nodeIn):

s is a LanguageStem and n is a language-tagged string with a language tag l
and fn:starts-with(l, st)

It has these defects:

  • it will match language "Carro"@ese where ese is Ese Ejja, and I don't think those people got cars ;-)
  • it won't match "Carro"@ES but lang tags are defined to be case-insensitive.
  • (instead of st should refer to s)

Instead of simple prefix match, it should comply with https://www.w3.org/TR/sparql11-query/#func-langMatches semantics. RFC4647 defines tags for lang, script, dialect, region etc etc; and that it's case-insensitive. Assuming s doesn't end in - and assuming . represents concat, it can be defined eg like:
regex (l, "(^".s."$)|(^".s."-)", "i")
Note: a simpler regex would be "^".s."($|-)" but I don't believe the last part of it is valid.

Aside: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry is a bit unreadable. The script https://gist.github.com/VladimirAlexiev/8733439 turns it into this more readable google sheet

TEST: @ericprud gave this example URL. For me, it doesn't load the test on first load (or control-shift-R) but loads it on second refresh (control-R):
http://rawgit.com/shexSpec/shex.js/master/doc/shex-simple.html?schema=%3CS%3E%20%7B%20%3Cp%3E%20%5B%40aa~%5D%20%7D&data=%3Cexact%3E%20%3Cp%3E%20%22exact%22%40aa%20.%0A%3Csub%3E%20%3Cp%3E%20%22sub%22%40aa-ES%20.%0A%3CshouldFail%3E%20%3Cp%3E%20%22shouldFail%22%40aaa-ES%20.%0A&shape-map=%7BFOCUS%20%3Cp%3E%20_%7D%40%3CS%3E

@jimkont
Copy link
Contributor

jimkont commented Sep 25, 2017

Resolved with 20170915 meeting

Resolution: change language tag matching to follow RFC4647 per

voted by: Andra, Kat, ericP, tom

@jimkont jimkont closed this as completed Sep 25, 2017
@ericprud
Copy link
Contributor

ericprud commented Oct 2, 2017

See ~ LanguageStem follows rfc4647

ericprud pushed a commit to shexSpec/shexTest that referenced this issue Oct 3, 2017
ericprud pushed a commit to shexjs/shex.js that referenced this issue Oct 3, 2017
@ericprud
Copy link
Contributor

ericprud commented Oct 3, 2017

need feedback from @VladimirAlexiev on spec changes and tests before closing. Note that the issue demo fails on master (<shouldFail> passes because the test doesn't respet rfc4647) but passes the LanguageStem-rfc4647 branch.

@ericprud ericprud reopened this Oct 3, 2017
@VladimirAlexiev
Copy link
Author

VladimirAlexiev commented Oct 3, 2017

Spec sounds good, I like the ref to https://tools.ietf.org/html/rfc4647#section-3.3.1. Maybe say that * is not allowed, and what happens if I give an incomplete lang tag eg @e~ (answer: won't match any value).

Tests look correct, but:

  • feel a bit uncomfortable about using unregistered sublang tags like @fr-bel
  • Maybe do some case variation (the matching should be case-insensitive)

Cheers @ericprud !

@ericprud
Copy link
Contributor

ericprud commented Oct 3, 2017

I was going to do a separate PR to add "*" to the grammar a la

[55] languageRange ::= (LANGTAG | '*') ('~' languageExclusion*)?

I tried to find two region codes that where one was a substring of other. Do you know where I can find the canonical list of regions? I picked a valid three-letter ISO region code ("bel"). I guess I could switch from FR to DE and use the example from RFC4647 basic match.

Re: case variation, true. Early on, I had data files like [email protected] and [email protected] but I think some case-insensitive file system ate them long ago. Will re-add tests for that and for shex files matching @FR, . - ~@FR and @FR~ - ~FR-BE.

@VladimirAlexiev
Copy link
Author

VladimirAlexiev commented Oct 3, 2017

Regions: https://docs.google.com/spreadsheets/d/1M1yv9aBUmc-NyCJX69vOLUmH2uIglSwmDwgRgByI1AI/edit#gid=2001354273 and filter by type=region.
These are 2-letter country codes and 3-digit continent-like codes. So there are no "substring of another".

But if there were, the matching is still the same: next should come dash or end of string. I.e. @en-G~ will not match @en-GB and @en-GR.

What do you want with *? Eg @*-GB to match any language spoken in Great Britain?

!!!!! Because Cyrl is the default script for ru, ru is the same as ru-Cyrl. This means that ru-RU~ should match ru-Cyrl-RU. My oh my.

And the star would add more complications

hsolbrig added a commit to shexSpec/shexTest that referenced this issue Oct 6, 2017
ericprud pushed a commit to shexSpec/shexTest that referenced this issue Oct 7, 2017
ericprud pushed a commit to shexSpec/shexTest that referenced this issue Oct 7, 2017
@ericprud
Copy link
Contributor

ericprud commented Oct 7, 2017

Re case sensitivity, I varied the case in the data and the schema. The latter raised a round-tripping issue to RDF. I invite you to review those PRs.

@ericprud
Copy link
Contributor

It is our belief that the semantics in ShEx 2.1 § 5.4.6 Values Constraint address this. Please close this issue if you agree.

@ericprud ericprud added this to the 2.1 milestone Nov 23, 2018
@VladimirAlexiev
Copy link
Author

I've read the section and I think it addresses this by reference to other standards. In particular I like:
st is a basic language range per Matching of Language Tags [rfc4647] section 2.1 and l matches st per the basic filtering scheme defined in [rfc4647] section 3.3.1.

In other words, one is not supposed to use an incomplete stem like en-G~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants