Normalizing English language tag #3100

adunning · 2024-12-05T06:18:37Z

Debug log ID

FH3W5CKW-refs-euc/6.7.263-7

What happened?

The CSL spec indicates that the language field should provide ISO 639-1 language tags (i.e. IETF tags). Hence, pandoc-citeproc follows this to the letter and will only apply title case to items either with no language specified or with the tag en. Unfortunately, Zotero does not normalize this on import, and many items end up with non-IETF tags in the language field, mostly ISO 639-2 codes, which triggers an undesired sentence-case citation. It would be most helpful if BBT could convert ISO 639-2 to ISO 639-1 language codes, and perhaps also normalize strings such as English to en.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-06T14:57:17Z

🤖 this is your friendly neighborhood build bot announcing test build 6.7.263.7430 ("fixes #3100")

This update may name other issues, but the build just dropped here is for you; it just means problems already fixed in other issues have been folded into the work we are doing here. Install in Zotero by downloading test build 6.7.263.7430, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

adunning · 2024-12-06T16:30:47Z

This works well; thank you! I would suggest two improvements:

It appears that there is no need to change en-US to en, as both are valid IETF codes. Including a region still results in the correct capitalization in Pandoc:

pandoc --citeproc -t plain << EOT
---
references:
- id: example
  author: "Author"
  title: "Example title"
  language: "en-GB"
  issued:
    year: 2024
---

Citation: [@example].
EOT

It is possible to have multiple language codes stored in the Language field, delimited with a space (e.g. eng lat) or occasionally a semicolon (e.g. eng;lat). For examples, add the ISBNs 978-0-19-815039-8 or 978-1-908590-41-1 to a Zotero library. If these codes are converted to IETF tags, this will give the expected result in Pandoc:

pandoc --citeproc -t plain << EOT
---
references:
- id: example
  author: "Author"
  title: "Example title"
  language: "en la"
  issued:
    year: 2024
---

Citation: [@example].
EOT

Many thanks again!

njbart · 2024-12-07T08:53:09Z

A word of caution, though: In CSL, the language variable is supposed to hold one single language tag only, and the variable’s current unique role is to switch on conversion of titles to title-case when rendering an item’s metadata if the language tag starts with en (and if, in addition, asked by the CSL style to do so, of course).

From https://docs.citationstyles.org/en/stable/specification.html#appendix-iv-variables (note the singular!):

language
The language of the item;
Should be entered as an ISO 639-1 two-letter language code (e.g. “en”, “zh”), optionally with a two-letter locale code (e.g. “de-DE”, “de-AT”)

The reason the language: "en la" example works as expected is merely because the tag starts with en; with language: "la en", it does not.

Unfortunately, there is no CSL variable indented to record the language(s) the content of a work is written in (for this purpose, biblatex has language; confusingly, as CSL’s language equals biblatex’s langid).

retorquere · 2024-12-07T13:11:56Z

But then there's no benefit to adding En-US over just en.

njbart · 2024-12-07T14:09:45Z

I’d still recommend not throwing away information, so I’d always import something like american as en-US rather than just en. (I wouldn’t add information, though, so english should just remain en.)

In any case, keeping language-plus-locale tags is essential when exporting to biblatex, as biblatex can also modify hyphenation, punctuation, and localised terms, all of which might differ between, say, en-US and en-GB, or de-DE and de-AT.

From the current biblatex manual:

It is highly advisable to always specify american, british, australian, etc. rather than english when loading the babel/polyglossia packages to avoid any possible confusion.

retorquere · 2024-12-07T14:23:02Z

But that doesn't apply to CSL, right?

njbart · 2024-12-07T16:06:30Z

Right. From a CSL (processor) perspective, it currently does not matter if it’s en or en-US.

That being said, en-US is a perfectly valid CSL language tag (see quote from the CSL specs, above), so there’s no reason not to use it.

The OP was about normalising upon import after all, where I would continue to argue that throwing away available information (e.g., by ‘normalising’ from en-US to en) is not a good idea since this very information might be useful, at the very least when exporting to biblatex.

retorquere · 2024-12-07T17:11:40Z

That is not my understanding - I think the OP was talking about items already in Zotero, and that during that import (from whatever source) the dates end up being a hodgepodge (likely so no information is discarded), and how they could be normalized on CSL export. I don't have CSL import, just export.

The reason I'd prefer to leave it as en is that I currently reuse code I already have, and changing it would be kind of involved.

adunning · 2024-12-07T20:26:53Z

Yes, my aim is purely to export items from Zotero into valid CSL JSON, for use in Pandoc. While currently this only changes whether title case is applied, I plan to see whether language tagging can also be applied to citations, if this field can be normalized reliably.

I hadn't realized that it was against the spec to list more than one language tag. In that case, if more than one is recorded in Zotero, perhaps only the first could be kept?

If I can get Pandoc to output language tagging with citations, it could be useful to be able to distinguish between, for example, de-DE and de-AT (as @njbart notes), to control hyphenation. It will make no difference for Pandoc's current functionality. I had mainly assumed that the code could be simplified if it were not concerned with discarding this information – it's probably not worth your time if that's not the case.

retorquere · 2024-12-07T23:59:01Z

Zotero doesn't really have the concept of multiple language stored per item. It's a single free-form string.

I can take a look later next week what I can do about locales. It may in the end be simpler but it's not now. The language normalizer in BBT scripts off of babel's language configs, and I don't recall how much flexibility I kept in that process.

retorquere added a commit that referenced this issue Dec 6, 2024

fixes #3100

c759846

github-actions bot added the awaiting-user-feedback label Dec 6, 2024

github-actions bot removed the awaiting-user-feedback label Dec 6, 2024

github-actions bot added the awaiting-user-feedback label Dec 7, 2024

github-actions bot removed the awaiting-user-feedback label Dec 7, 2024

github-actions bot added the awaiting-user-feedback label Dec 7, 2024

github-actions bot removed the awaiting-user-feedback label Dec 7, 2024

github-actions bot added the awaiting-user-feedback label Dec 7, 2024

github-actions bot removed the awaiting-user-feedback label Dec 7, 2024

github-actions bot added the awaiting-user-feedback label Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalizing English language tag #3100

Normalizing English language tag #3100

adunning commented Dec 5, 2024

github-actions bot commented Dec 6, 2024

adunning commented Dec 6, 2024

njbart commented Dec 7, 2024

retorquere commented Dec 7, 2024

njbart commented Dec 7, 2024

retorquere commented Dec 7, 2024

njbart commented Dec 7, 2024

retorquere commented Dec 7, 2024

adunning commented Dec 7, 2024

retorquere commented Dec 7, 2024

Normalizing English language tag #3100

Normalizing English language tag #3100

Comments

adunning commented Dec 5, 2024

Debug log ID

What happened?

github-actions bot commented Dec 6, 2024

adunning commented Dec 6, 2024

njbart commented Dec 7, 2024

retorquere commented Dec 7, 2024

njbart commented Dec 7, 2024

retorquere commented Dec 7, 2024

njbart commented Dec 7, 2024

retorquere commented Dec 7, 2024

adunning commented Dec 7, 2024

retorquere commented Dec 7, 2024