-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out whether, or how to support the extended ISO 639-3 list of languages #8578
Comments
These language codes are part of the Citation metadata block, defined as valid controlled vocabulary values for the field "language". So strictly speaking, these values are not in the source code. If this is really urgent, you could fix it in your installation yourself, by adding the lines for "fro" and "frm" etc. to the standard citation.tsv, then update the metadata block (with |
Related (thanks for the comment, Leonid): |
people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time: Added back the laberl: NIH OTA: 1.4.1 Need to touch base with Leonid on this. |
I looked at this a while ago and am not sure I understand it all. However, FWIW: We have < 200 language codes today (ISO 693-2) and for ISO 693-3 : 'As of 18 February 2021, the standard contains 7,893 entries'. If we simply cut/paste new values we will be making the list users have to scroll through ~40 times bigger. Further 'ISO 639-3 is not a superset of ISO 639-2.' and some languages will have both 693-2 and 693-3 codes. 693-3 also has some hierarchy - with macro-languages that include sub-languages. We also may have to understand how to handle a mix of 693-2 and 693-3 codes for export (what do you do when a language has both codes?) and import. For import, I think our code will already look for aliases of a term so, I think we could accept imports in either standard without more work (should be tested though). |
Review with Leonid
|
I'm not sure what to do with this one. |
2023/12/19: Requires additional conversation with @DS-INRA and @tjouneau to determine next steps. Note that this is primarily a metadata issue rather than a harvesting issue. |
Thanks for the ping and relaunching the discussion. |
This issue (#8578) is sprint ready but before anyone picks it up I think we should:
|
I'm going to change the title of the issue, since we've been de-facto planning to use this issue to figure out if we are going to, or how to offer support for the full ISO 639-3 list in general, and not just within the context of import, or specifically harvesting. There are apparently real life instances where users do want to have the full ~8K extended list, as an actual controlled vocabulary (so, no, the option added in #10323 - allowing an instance to harvest non-CVV conforming values from other sources - while useful to some instances, is not going to solve the issue for everybody). Case in point: see the comments from a user in #10481. There are good arguments against adding the full list to the metadata block that we distribute for everyone (see Jim's comment above). An external CV could be a solution. Or perhaps a standard mechanism for an optional CV "expansion pack" that an instance can choose to install. |
I'm working on a solution that allows an admin to download the full ISO 639-3 list and directly load it into Dataverse via the same api that loads the tsv files. It merges the languages into the CV. I haven't seen any lag in the ui with the addition of 7,615 languages. I'd still like to test the loading of both 639-2 and 639-3 but if this solution is not acceptable then I won't waste the time, and start looking at other options. |
I don't have a strong opinion but I think @landreev was concerned about ~8k language entries in the database. If it's performant, maybe it's ok? 🤷 Using an external controlled vocabulary service might be an option as well, assuming it exists. |
@stevenwinship @qqmyers @pdurbin However, I don't think this (straightforward and simple) solution should be considered completely off the table. If your experiments with the UI suggest that the performance, and the look-and-feel for the user, are not atrocious, reconsidering it should be up for debate. It wasn't just the size though, I would suggest taking a close look at Jim's comment from 1.5 years ago and see if all the questions there have been answered. Having taken a quick look, one serious unknown there is about hierarchy (the "macro-languages" defined in ISO 639-3). But, I'm wondering if the solution is... to just not worry about it, and handle them all as a list? |
@stevenwinship @qqmyers @pdurbin
|
@stevenwinship But, everything I said earlier still stands, I believe. That could be a potential model for distributing the CVV. Or, if we play with it and conclude that the UI is working fine with that full list - then we may just shove it into the distributed citation.tsv. But, I should also put it on record, that this is the kind of an issue that may not be super challenging technically, but will need to have more people involved to finalize any decisions. I can think of Julian, since metadata is his thing. I personally don't necessarily have a beef in it or any super strong opinions - I just got to work on #8243 recently, and ended up learning a lot about the ISO codes. |
I'd suggest not having a one-off mechanism for language. If the UI works with 8K items, adding to the block seems OK. If we need to deal with the hierarchy in the UI, probably the easiest and most SPA ready would be to use the external vocab mechanism and JavaScript. It would be nice to allow people who only want to use the -2 list to do so - not sure how to do that with one block, but the external mechanism could be configurable (either a flag in the script or two scripts). If there isn't an online service to ping the ext. mechanism can just have the script include the static list. I definitely second having UX discussion on this, especially if there won't be a choice to stay with the shorter list. I think it is also worth investigating whether the use of aliases is enough to make export and harvesting work as expected. I.e. are -3 values OK in DDI, DataCite, etc. for the people who use those today. Do we need to use the -2 version where possible for some users? As @landreev said, these are more UX/user questions than technical ones. |
…the citation block #8578 (temporary? - we can drop these from the branch before we merge)
After version 5.4, things have improved regarding language mapping problems.
Some codes are still not managed. In the cases encountered, frm (Medieval French) and fro (Old French).
Would it be possible to include all codes in the Dataverse source?
What steps does it take to reproduce the issue?
Try to harvest from https://repository.ortolang.fr/api/oai/?verb=ListRecords&set=producer:atilf&metadataPrefix=oai_dc
6 datasets are not harvested, 4 due to language mapping issues.
What happens?
Mapping errrors documented in the harvest log :
Exception processing getRecord(), oaiUrl=https://repository.ortolang.fr/api/oai, identifier=oai:ortolang.fr:0c2017f1-7c3b-473a-b75d-ad97b4e09bd0, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.util.json.ControlledVocabularyException (Value 'fro' does not exist in type 'language')"
I'm attaching the server.log relevant extract and the harvest log.
harvest_ortolang3_2022-04-04T15-34-00.log
server.log
Which version of Dataverse are you using?
5.10
Any related open or closed issues to this bug report?
The text was updated successfully, but these errors were encountered: