language stem should respect langMatches semantics #71

VladimirAlexiev · 2017-09-05T08:02:15Z

The following shape:
:SpanishProduct { schema:label [ @es~ ] }
Declares that products must have a label in Spanish or any variant of it (eg es-ES vs es-AR).

But LanguageStem is defined as simple prefix match (http://shex.io/shex-semantics/#nodeIn):

s is a LanguageStem and n is a language-tagged string with a language tag l
and fn:starts-with(l, st)

It has these defects:

it will match language "Carro"@ese where ese is Ese Ejja, and I don't think those people got cars ;-)
it won't match "Carro"@ES but lang tags are defined to be case-insensitive.
(instead of st should refer to s)

Instead of simple prefix match, it should comply with https://www.w3.org/TR/sparql11-query/#func-langMatches semantics. RFC4647 defines tags for lang, script, dialect, region etc etc; and that it's case-insensitive. Assuming s doesn't end in - and assuming . represents concat, it can be defined eg like:
regex (l, "(^".s."$)|(^".s."-)", "i")
Note: a simpler regex would be "^".s."($|-)" but I don't believe the last part of it is valid.

Aside: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry is a bit unreadable. The script https://gist.github.com/VladimirAlexiev/8733439 turns it into this more readable google sheet

TEST: @ericprud gave this example URL. For me, it doesn't load the test on first load (or control-shift-R) but loads it on second refresh (control-R):
http://rawgit.com/shexSpec/shex.js/master/doc/shex-simple.html?schema=%3CS%3E%20%7B%20%3Cp%3E%20%5B%40aa~%5D%20%7D&data=%3Cexact%3E%20%3Cp%3E%20%22exact%22%40aa%20.%0A%3Csub%3E%20%3Cp%3E%20%22sub%22%40aa-ES%20.%0A%3CshouldFail%3E%20%3Cp%3E%20%22shouldFail%22%40aaa-ES%20.%0A&shape-map=%7BFOCUS%20%3Cp%3E%20_%7D%40%3CS%3E

The text was updated successfully, but these errors were encountered:

jimkont · 2017-09-25T07:22:17Z

Resolved with 20170915 meeting

Resolution: change language tag matching to follow RFC4647 per

voted by: Andra, Kat, ericP, tom

ericprud · 2017-10-02T22:19:09Z

See ~ LanguageStem follows rfc4647

ericprud · 2017-10-03T11:48:26Z

need feedback from @VladimirAlexiev on spec changes and tests before closing. Note that the issue demo fails on master (<shouldFail> passes because the test doesn't respet rfc4647) but passes the LanguageStem-rfc4647 branch.

VladimirAlexiev · 2017-10-03T14:15:15Z

Spec sounds good, I like the ref to https://tools.ietf.org/html/rfc4647#section-3.3.1. Maybe say that * is not allowed, and what happens if I give an incomplete lang tag eg @e~ (answer: won't match any value).

Tests look correct, but:

feel a bit uncomfortable about using unregistered sublang tags like @fr-bel
Maybe do some case variation (the matching should be case-insensitive)

Cheers @ericprud !

ericprud · 2017-10-03T14:48:33Z

I was going to do a separate PR to add "*" to the grammar a la

[55] languageRange ::= (LANGTAG | '*') ('~' languageExclusion*)?

I tried to find two region codes that where one was a substring of other. Do you know where I can find the canonical list of regions? I picked a valid three-letter ISO region code ("bel"). I guess I could switch from FR to DE and use the example from RFC4647 basic match.

Re: case variation, true. Early on, I had data files like [email protected] and [email protected] but I think some case-insensitive file system ate them long ago. Will re-add tests for that and for shex files matching @FR, . - ~@FR and @FR~ - ~FR-BE.

VladimirAlexiev · 2017-10-03T15:10:47Z

Regions: https://docs.google.com/spreadsheets/d/1M1yv9aBUmc-NyCJX69vOLUmH2uIglSwmDwgRgByI1AI/edit#gid=2001354273 and filter by type=region.
These are 2-letter country codes and 3-digit continent-like codes. So there are no "substring of another".

But if there were, the matching is still the same: next should come dash or end of string. I.e. @en-G~ will not match @en-GB and @en-GR.

What do you want with *? Eg @*-GB to match any language spoken in Great Britain?

I think this falls under "extended matching" https://tools.ietf.org/html/rfc4647#section-3.3.2. And you can put the star in any position, so the above [55] is not enough.
Check whether langMatches() supports it, I'm a bit doubftul

!!!!! Because Cyrl is the default script for ru, ru is the same as ru-Cyrl. This means that ru-RU~ should match ru-Cyrl-RU. My oh my.

And the star would add more complications

+ tests for [shexSpec/shex#71]

ericprud · 2017-10-07T11:53:01Z

Re case sensitivity, I varied the case in the data and the schema. The latter raised a round-tripping issue to RDF. I invite you to review those PRs.

ericprud · 2018-11-23T10:19:25Z

It is our belief that the semantics in ShEx 2.1 § 5.4.6 Values Constraint address this. Please close this issue if you agree.

VladimirAlexiev · 2018-11-27T08:59:01Z

I've read the section and I think it addresses this by reference to other standards. In particular I like:
st is a basic language range per Matching of Language Tags [rfc4647] section 2.1 and l matches st per the basic filtering scheme defined in [rfc4647] section 3.3.1.

In other words, one is not supposed to use an incomplete stem like en-G~

jimkont closed this as completed Sep 25, 2017

ericprud mentioned this issue Oct 3, 2017

Language stem rfc4647 shexSpec/spec#17

Merged

ericprud pushed a commit to shexSpec/shexTest that referenced this issue Oct 3, 2017

+ tests for [shexSpec/shex#71]

46c7f1a

ericprud pushed a commit to shexjs/shex.js that referenced this issue Oct 3, 2017

+ rfc4647 basic filtering per [shexSpec/shex#71]

9be1728

ericprud reopened this Oct 3, 2017

hsolbrig added a commit to shexSpec/shexTest that referenced this issue Oct 6, 2017

Merge pull request #23 from shexSpec/LanguageStem-rfc4647

fde6a52

+ tests for [shexSpec/shex#71]

ericprud pushed a commit to shexSpec/shexTest that referenced this issue Oct 7, 2017

~ address data LangTag in [shexSpec/shex#71]

aad887e

ericprud pushed a commit to shexSpec/shexTest that referenced this issue Oct 7, 2017

~ address schema LangTag in [shexSpec/shex#71]

af8c8ff

ericprud mentioned this issue Oct 7, 2017

Round tripping language tags case #73

Open

ericprud added the wait-commenter-close label Nov 23, 2018

ericprud added this to the 2.1 milestone Nov 23, 2018

VladimirAlexiev closed this as completed Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

language stem should respect langMatches semantics #71

language stem should respect langMatches semantics #71

VladimirAlexiev commented Sep 5, 2017 •

edited

Loading

jimkont commented Sep 25, 2017

ericprud commented Oct 2, 2017

ericprud commented Oct 3, 2017 •

edited

Loading

VladimirAlexiev commented Oct 3, 2017 •

edited

Loading

ericprud commented Oct 3, 2017

VladimirAlexiev commented Oct 3, 2017 •

edited

Loading

ericprud commented Oct 7, 2017 •

edited

Loading

ericprud commented Nov 23, 2018

VladimirAlexiev commented Nov 27, 2018

language stem should respect langMatches semantics #71

language stem should respect langMatches semantics #71

Comments

VladimirAlexiev commented Sep 5, 2017 • edited Loading

jimkont commented Sep 25, 2017

ericprud commented Oct 2, 2017

ericprud commented Oct 3, 2017 • edited Loading

VladimirAlexiev commented Oct 3, 2017 • edited Loading

ericprud commented Oct 3, 2017

VladimirAlexiev commented Oct 3, 2017 • edited Loading

ericprud commented Oct 7, 2017 • edited Loading

ericprud commented Nov 23, 2018

VladimirAlexiev commented Nov 27, 2018

VladimirAlexiev commented Sep 5, 2017 •

edited

Loading

ericprud commented Oct 3, 2017 •

edited

Loading

VladimirAlexiev commented Oct 3, 2017 •

edited

Loading

VladimirAlexiev commented Oct 3, 2017 •

edited

Loading

ericprud commented Oct 7, 2017 •

edited

Loading