Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dutch-specific replacement of IJ #437

Closed
Bolpat opened this issue Feb 12, 2021 · 24 comments · Fixed by #460
Closed

Dutch-specific replacement of IJ #437

Bolpat opened this issue Feb 12, 2021 · 24 comments · Fixed by #460
Labels
enhancement pr-welcome A PR contributing towards this issue would be welcome. script latin

Comments

@Bolpat
Copy link

Bolpat commented Feb 12, 2021

Heijn

Language-specific OTF features are active in Libertinus and work as far as I tested*.
However, in Dutch, there is a digraph/letter/ligature consisting of I and J. There are Unicode points defined for both the upper and lower case form (U+0132/-3, IJ and ij) that look (as far as I can tell) identical in the upright form in Libertinus, i.e. it makes visually no difference entering IJ or IJ (ij or ij) in XeLaTeX.
Problem: The italic Libertinus fonts display U+0132/-3 differently than separate I and J as can be seen in the image above.
Expected: When the text is marked Dutch, the italic font should look like the second forms regardless whether IJ or IJ (ij or ij) is used (i.e. one should not have to resort to enter U+0132/-3 in ones text to get a pleasing output). As the IJ is common in Dutch, it really should look perfectly of the box.

It might be worth mentioning that there is no combined version (like Ij) since at the beginning of a word, both letters are set upper case: IJssel.

There are (rare) words that have a coincidental i+j clash such as bijectie (from bi- and ject, Latin for "throw") that one would have to handle manually, but that's a general issue with ligatures, end forms and similar features.

Furthermore, IJ/ij can carry an accent (mostly the acute) and when it does, ideally, both letters should carry it.
Upright IJ/ij with any variant of placing an accent on the letters I and J separately looks as expected. Using U+0132/-3, however, looks terrible. (This state is acceptable, virtually no one uses U+0132/-3 when typing text.)
Italic IJ/ij/U+0132/-3 look all terrible with an accent on them. (The fact that separate I and J with accent look bad is purely due to J/j not working with accents; that italic U+0132/-3 don't work with accents is acceptable.)

*) I tested: Turkish small caps (keep the dot on a small i) and Serbian italic Cyrillic (б г д п т shaped as expected).

@alerque
Copy link
Owner

alerque commented Feb 12, 2021

Thanks for the report. This looks like something that shouldn't be too hard to fix, but no promises when I might get around to it.

A couple questions for when I do (or somebody else does):

  • Is there any normal form for combining the glyphs in uppercase like there is for the lowercase form?
  • What accent marks could potentially get used here?
  • Would it make any sense to put these in the discretionary ligature set, or should it just be on by default?

@moyogo
Copy link

moyogo commented Feb 12, 2021

Note that any borrowed word that has the sequence would also get the ligature: Fiji, Tijuana, etc. It's not clear how to best handle these.

Regarding the accents, emphasis on a word where the stressed syllable is written with ij would be written with íj́ or íj for lack of j́.

@georgd
Copy link

georgd commented Feb 12, 2021

In addition to above remarks, I’d probably not substitute i j by U+0133 but rather create an additional i_j glyph for that purpose.

@alerque alerque added the pr-welcome A PR contributing towards this issue would be welcome. label Feb 12, 2021
@Bolpat
Copy link
Author

Bolpat commented Feb 23, 2021

Note that any borrowed word that has the sequence would also get the ligature: Fiji, Tijuana, etc. It's not clear how to best handle these.

I don't speak Dutch, so take my answer with a grain of salt: There is no one size fits all, but since in Dutch IJ isn't particularly rare but rather common, words in which I is followed by an unrelated J are truly rare. Acute accents on IJ could be considered common and should be supported ideally. Any others can be ignored, I guess.

For ij with acute accent, there are various probable encodings:

  1. U+00ED (í), j, U+0301
  2. i, U+0301, j, U+0301
  3. i, j, U+0301
  4. i, U+0301, j
  5. U+00ED (í), j
  6. U+0133, U+0301

Here, 1 and 2 should both be supported. They're the Correct Way. 3 can probably be dismissed, but I wanted to mention it. 4 and especially 5 are typical replacements that can be entered by a keyboard with dead keys. 6 uses the IJ Unicode glyph that, while being discouraged in use, probably also should work with an accent.

Capital IJ with acute accent is far rarer than lower case one.

In German, my native language, I need to use the LaTeX commands \textcompwordmark and \- in some cases to get proper results. That's not surprising, they exist for a reason. It's a part of professional typesetting to know that. Casual Dutch writings wouldn't become unreadable if the IJ glyph would be used in Fiji or Tijuana (note that the difference is only noticeable in italics).
I'd say that wrong automatic hyphenation of German words like Kreis·chen (in contrast to krei·schen) is worse theoretically, but such problems rarely manifest in practice.

Fixing U+0132/-3 shouldn't be too hard if everything else can be done.

@KrasnayaPloshchad
Copy link

KrasnayaPloshchad commented Feb 27, 2021

@laszlonemeth have already implemented IJ ligature in his fonts.
https://numbertext.org/linux/NEWS-20110101.pdf

@KrasnayaPloshchad
Copy link

KrasnayaPloshchad commented Feb 27, 2021

Note that any borrowed word that has the sequence would also get the ligature: Fiji, Tijuana, etc. It's not clear how to best handle these.

One simple way to eliminate the ligature in such words is to insert a ZWNJ between i and j. OpenType also has 'NLD ' and 'FLM ' language tags for Dutch/Flemish locales, so even such features are enabled by default, they couldn’t have many more affects to other languages.

@ivo-s
Copy link
Contributor

ivo-s commented Mar 9, 2021

In addition to above remarks, I’d probably not substitute i j by U+0133 but rather create an additional i_j glyph for that purpose.

Could you please develop this? The U+0133 glyph is already present in all Libertinus fonts, and it has a convenient name ij/IJ. As a layman, it seems to me that making another i_j glyph outside of the Unicode range (and most likely just referencing U+0133 there) is redundant. It would seem that this digraph is often treated as a single letter in the Netherlands, sometimes taking the form of a U with a gap, so I see it as just a standard Latin alphabet extension like letters with diacritics.
https://en.wikipedia.org/wiki/IJ_(digraph)
Maybe this point is also related to the discussion in #456 ?

@Crissov
Copy link
Contributor

Crissov commented Mar 9, 2021

There can be a vowel letter like e before the i, as in Heijn, but can there also be one after the j? – except in non-applicable words like the mentioned bijectie, of course.

@georgd
Copy link

georgd commented Mar 10, 2021

@khaledhosny is explaining it here a bit:

#455 (comment)

Substituting encoded glyphs for other encoded glyphs is a potential source for unexpected and buggy behaviour. It might not be as obvious as in other cases but this is a well known principle to avoid bugs which is quite simple to follow so I'd not risk anything here.

@KrasnayaPloshchad
Copy link

FontForge has the feature Copy Reference, after you press Ctrl+V to paste, you’ll get a direct clone of the glyph, it’s a good choice to avoid substituting encoded glyphs for other encoded glyphs and makes maintenance easier.

@Crissov
Copy link
Contributor

Crissov commented Mar 10, 2021

Should there also be an opt-in stylistic character variant (cvXY), wherein uppercase and small-caps IJ is displayed with the same glyph as the letter Y and lowercase ij is displayed the same as the letter ÿ (ydieresis)? (Not sure how it would look like with acute, ý or ӳ.)

@ivo-s
Copy link
Contributor

ivo-s commented Mar 10, 2021

Substituting encoded glyphs for other encoded glyphs is a potential source for unexpected and buggy behaviour. It might not be as obvious as in other cases but this is a well known principle to avoid bugs which is quite simple to follow so I'd not risk anything here.

I see; Unicode glyphs should be reserved for the intended user input, and a font should not be swapping between them by itself. Any substitutions should be unencoded glyphs. It is much more obvious for the cases in #455, that the font features should not be treated as an autocorrect. Apparently this is a generally recognized good practice, so I will make the appropriate changes right away.

Should there also be an opt-in stylistic character variant (cvXY), wherein uppercase and small-caps IJ is displayed with the same glyph as the letter Y and lowercase ij is displayed the same as the letter ÿ (ydieresis)? (Not sure how it would look like with acute, ý or ӳ.)

According to the wiki, ÿ is actually a separate letter that is different from ij, even though they look the same in handwiting. Afrikaans uses y, but that is a separate language. In any way, as discussed above, it should be up to the user what they write.

@Crissov
Copy link
Contributor

Crissov commented Mar 10, 2021

I meant that IJ.alt would be an alias to Y (also ij.sc.alt = y.sc) and ij.alt would be an alias to ydieresis, as described by @KrasnayaPloshchad above. This would be made available in a cvXY feature and, if deemed reasonable, there could be a locl rule for AFK (Afrikaans) as well – or is the lowercase a dotless y there?

@ivo-s
Copy link
Contributor

ivo-s commented Mar 10, 2021

The problem is not substituting Unicode, but the general idea that font features are intended for stylistic options, not substituting typable characters with other typable characters. Simply put, if the user wants to use y, they will type y. Features like "lowercase to uppercase" or "turn ö into oe" should be left to word processors. In theory, a valid alternate style would be the "broken U" glyph, but I assume that would not fit in this typefacfe, even in Libertinus Sans.

@georgd
Copy link

georgd commented Mar 11, 2021

The problem is not substituting Unicode, but the general idea that font features are intended for stylistic options, not substituting typable characters with other typable characters.

I don’t fully agree here. When you replace IJ by the encoded IJ digraph you are substituting with a (theoretically) typable character. The difference is much more between typographic variation on the glyph level and orthographic choice. The former is well targetted by font features, doing it in font features for the latter is doubtful.

Simply put, if the user wants to use y, they will type y. Features like "lowercase to uppercase" or "turn ö into oe" should be left to word processors.

These two could fall in the category typographic variation. I know that dutch y and ij are equally pronounced but I can’t say anything about their orthography, so that question should be answered by somebody else. The case of Ö vs. Oe in German, however, is no question of orthography. They are explicitly considered equivalent. So, I would accept a stylistic font feature replacing Odieresis with a glyph Odireresis.alt which looks like 'Oe' or like 'Œ' (as a ligature of Oe) or like 'O ͤ' (this should be an uppercase O with a combining small e above) or some variation on this.

@ivo-s
Copy link
Contributor

ivo-s commented Mar 11, 2021

I don’t fully agree here. When you replace IJ by the encoded IJ digraph you are substituting with a (theoretically) typable character. The difference is much more between typographic variation on the glyph level and orthographic choice. The former is well targetted by font features, doing it in font features for the latter is doubtful.

OK, IJ is theoretically typable, but I see the matter of ij/y as an orthographical choice, as you put it. I draw my information from the Internet, so I'm not completely sure, but I understand ij to be the correct spelling, even though historically it might have evolved from y. Substituting ij with y is a good English transcription, because the pronunciation is clearer. However, there is no y in Dutch alphabet. A quick Google reveals:
https://www.dutchgrammar.com/en/?n=SpellingAndPronunciation.03
https://www.dutchgenealogy.nl/there-is-no-letter-y-in-the-dutch-alphabet/

The case of Ö vs. Oe in German, however, is no question of orthography. They are explicitly considered equivalent. So, I would accept a stylistic font feature replacing Odieresis with a glyph Odireresis.alt which looks like 'Oe' or like 'Œ' (as a ligature of Oe) or like 'O ͤ' (this should be an uppercase O with a combining small e above) or some variation on this.

I stand corrected.

@Bolpat
Copy link
Author

Bolpat commented Mar 16, 2021

The case of Ö vs. Oe in German, however, is no question of orthography. They are explicitly considered equivalent.

I don't think so, especially for names. However, placing a small e on the letter qualifies as an (archaic) stylistic choice for Ä, Ö, and Ü. There is no clear-cut real-world example that comes to my mind where using ö and oe makes a difference, but that's probably due to the fact that it's not particularly easy to search for something like that. One could make up words like Koerbe from the English word co-heir (syn. parcener) which, in my opinion, isn't too far fetched. Körbe is the plural of Korb meaning basket.

@alerque
Copy link
Owner

alerque commented Apr 18, 2021

Many thanks to @ivo-s for implementing this feature. If there are any concerns about the actual implementation please do speak up before the next release. For now you should be able to download artifacts with this work from here to give this a test run.

@Bolpat
Copy link
Author

Bolpat commented Apr 21, 2021

In #460, Flemish is not included, which I think is a mistake. What matters isn't what school teachers say, but how the digraph is perceived. I have a custom stylesheet for Wikipedia and, well, have a look:
Flanders

Still, thank you for the NLD fix. Hope to see it soon in the release.

@alerque
Copy link
Owner

alerque commented Apr 21, 2021

Is this just a matter of adding FLM anywhere we have special handling for NLD? Or are there other pitfalls to be aware of?

@moyogo
Copy link

moyogo commented Apr 22, 2021

FLM is not a language system tag. FLE, named "Dutch (Flemish)", is the OT language system tag corresponding to the ISO 639 code vls which is West Vlaams or West Flemish, spoken in the western part of Flanders.

West Flemish does not use ij as standard Dutch does, it uses y instead, so having the same lookups for ij with the language system FLE may not make sense.

Note that the OT 1.5 language system tags did not specify "Flemish" FLE was corresponding to vls, that started in OT 1.6 and it only started being called "Dutch (Flemish)" in OT 1.7.
The names "Flemish" or "Dutch (Flemish)" are rather ambiguous and it's not clear if the original intention was to correspond to vls for West Flemish only. See https://en.wikipedia.org/wiki/Flemish.

@alerque
Copy link
Owner

alerque commented Apr 23, 2021

Thanks for the background @moyogo! From that info it sounds like doing nothing special for FLE (i.e. not matching the special handling added for NLD) is the correct thing to do. @Bolpat if you still think this is a mistake and we should do something I'm happy to evaluate further, but if so lets please open a new issue to discuss. Link or quote these latest comments to relpy.

@Bolpat
Copy link
Author

Bolpat commented Apr 29, 2021

@moyogo I didn't really get the FLE vs vls part, but the Wikipedia article about West Flemish reads:

ij - /ɛi/ is realised as [ɨ] (short ie, also written as y) and in some words as [ʉ].

I've looked for some more on the topic and it seems to me that it's not really clear cut since West Flemish does not have a standardized spelling. Apparently, speakers use Y, but consider it a variant or replacement of Dutch IJ. The claim that IJ is absolutely nothing special in West Flemish is nonsense. As far as I came, IJ might be used by a minority of West Flemish speakers where others would use Y. I guess adding IJ treatment is not incorrect for people using Y, but lacking for those using IJ.

@moyogo
Copy link

moyogo commented Apr 30, 2021

@Bolpat The paragraph before your sample sentence says "The following differences are listed by their Dutch spelling as some different letters have merged their sounds in Standard Dutch but remained separate sounds in West Flemish. "

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement pr-welcome A PR contributing towards this issue would be welcome. script latin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants