-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8243 improve language controlled vocab #10481
Conversation
…nto a local branch; #8243
…er conversation with requestor. #8243
… - I'm leaving the main name intact (so that the block update will still works), but adding both versions as extra alternative names, so that either is importable. #8243
…he ISO 639-1 and -2 codes for Nuosu. #8243
…the order in which they are listed in the current ISO 639-3 table) #8243
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a change to Malayalam? Please see my comment in this review.
language Malay 100 may msa ms | ||
language Malay 100 msa may ms | ||
language Malayalam 101 mal ml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a conversation going on where Youn Noh at Yale is proposing these changes:
ORIG:
language Malayalam 101 mal ml
NEW:
language Malayalam mal 4016 mal mal ml
Should we get this change in as well?
Please see https://groups.google.com/g/dataverse-community/c/rGkaHuzA2x4/m/GeyOF1kpAgAJ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdurbin The change above doesn't make much sense, so no, we certainly don't want it as is. I'm going to add a comment in the google group thread reiterating an earlier suggestion for the author to take a closer look at the opening comment in this PR, since it answers some of the questions asked there.
iqss_pr_10481.zip contains a sample script for citation.tsv in its original form and as modified by the script. The ISO 639-3 code tables are available for download from SIL; the input file from them is included in the attachment as well. (Please let me know if there are any issues with the attachment.) I think this clarifies the change we would like to see. I used the ISO 639-3 language code as the value for identifier and adhered to the existing column order in columns following displayOrder for mappings to previous versions of ISO 639. As I understand it, the mappings are needed to maintain compatibility with metadata standards such as OAI DC that use those versions. I had thought about excluding the rows from the ISO table where Scope and Language have a value of "S" but ended up leaving them to remain faithful to the standard. (IMPORTANT: We need to switch to ISO 639-3 to meet the request of a professor of linguistics who has graciously served as one of our beta testers. The languages needed by her research group are not in ISO 639-2.) |
@younnoh Thank you, I'll take a look. |
@younnoh If you haven't yet, please read carefully the opening comment for this PR that explains its scope and limitations. |
|
@younnoh The issue in question is #8578. It's been sitting untouched for a while, but please note that it's been recently tagged as "sprint ready", which means it is being scheduled to be revisited and worked on sometime soon (probably as soon as this PR is merged), and we'll use that issue to figure out what to do with/how to handle the full ISO 639-3 list. I'm in the middle of something now, but I'll try to get back to you with more info/answers to your other questions ASAP. One quick thing: please note that in the context of this issue, or the other issue referenced, we are talking about making changes to the "official" version of citation.tsv that we distribute with the application. But the beauty of keeping our "metadata blocks" configurable, and not hard-coded in the application, is that you don't really have to use the configuration that we supply. Within the core dev. team we've had somewhat of a consensus in that we don't want to add the full 639-3 list to the CV, in part because we don't believe that most users will want that pulldown Language menu on the Edit Dataset page to have 8K entries in it. But, if that's something your users do want to have, nobody stops you from adding these CVVs to the tsv file and adding them on your own instance. |
Thank you for the link to the issue and the update on your future sprint. I would like to raise the points below for your consideration as you decide how to proceed.
|
Short answer: yes, I'll do this.
- I just wasn't aware that we had this identifier, for the very purpose of solving this problem: to able to change the primary CVV when updating a metadata block. In my defense, nobody on our dev. team here corrected me on that. |
@younnoh just a quick question on this point:
It's already possible to type a few characters as in the example below. Are you looking for something else? |
Thanks, Phil. That's exactly what I meant. (I just requested greater permissions here to be able to do more testing before I comment.) |
@younnoh @pdurbin |
@younnoh no problem. I'm not sure what you mean by permissions but you should be able to test the language field over at https://demo.dataverse.org and see the same UI (and get the same UX) as the screenshot above. By the way, I started a thread on autocomplete here: https://dataverse.zulipchat.com/#narrow/stream/434038-ux-wg/topic/autocomplete.20for.20language/near/437500258 |
@pdurbin Thanks for the reminder about the demo Dataverse and the link to the thread on autocomplete. Autocomplete works well from the login page for the demo when you want to select your institution. (Sorry for any confusion caused by my previous comment. I just got around to creating an account for my institution's test Dataverse instance and will use it do further testing.) |
FWIW: The autocomplete in the author affiliation field is from an external vocabulary script using the https://ror.org vocab/api. The autocomplete for the lang field is Dataverse's internal one for CVVs listed in the metadata blocks. |
Is there a date by which it will be determined if (and when) ISO 639-3 will be adopted as citation metadata as part of an official release or update? I need to get back to my team on when we'll know to wait or to load as custom. Thanks! |
Sorry, I got distracted/forced to work on some urgent things instead. But I will finish it shortly (just wanted to populate the Identifier field in the CVV, otherwise leaving the PR as is; I believe it is ready to be merged otherwise). I suggest we take the discussion of whether, or how to support the full extended 639-3 list to #8578. Somebody else from the team is looking into it/some activity is happening there. |
…dates easier. Used the first 3-letter code as the identifier for each of the 185 supported languages. #8243
OK, populated the Identifier field for the supported languages. Moving back to "ready for QA"; will try to coax somebody to merge it soon-ish, maybe even tomorrow. |
QA: Verified that after the update all cvv's except "Not applicable" now have an id, that additional alternates have been added, that deleting from the db and re-loading the block recreates the same number of values in both tables, spot checked for specific values, verified that adding vals works in the UI, that editMetadata api works with the strValue of the cvv. I'll go ahead and merge. Note the one failed test is due to the know confirmEmail issue fixed in a different PR. |
What this PR does / why we need it:
This PR updates the Citation metadata block sanitizing the Controlled Vocabulary for the field "language". It replaces the draft PR #10197 made late last year.
The nature of the cleanup is as follows:
The CVV list is brought up to the following standard:
<dc:language>English</dc:language>
and<dc:language>en</dc:language>
in oai_dc are importable);<dc:language>eng</dc:language>
is also valid for the purposes of import;language Slovene 146 slv sl Slovenian
;Slovene
in the last example). This is to accommodate our model of metadata block updates which relies on matching on that name. If it ever becomes necessary to change any of the main names for whatever practical reason, we will have to do that via a Flyway update. The plan is to avoid that, at least for now; EDIT: We actually had a solution for this problem all along! The CVV entries in the block have a field reserved for the "identifier", that will match the value to its replacement, even with the main value changed. We apparently never used this field for the languages, but as of this PR, it is now populated (using the first 3-letter code available for the language as such permanent identifier). This means that going forward, we'll be able to change "Slovene" to "Slovenian", etc., if it ever becomes necessary.language Navajo, Navaho 109 nav nv
. For such cases both forms have been separately added as alternates, making eitherNavajo
orNavaho
importable on their own, as in:language Navajo, Navaho 109 nav nv Navajo Navaho
Which issue(s) this PR closes:
Closes #8243
Please note the issue is old and much of the information in the opening description already obsolete. The description above should be accurate.
Special notes for your reviewer:
Suggestions on how to test this:
I'm not sure what QA for this should be like, tbh. One could try and actually test that the extra language designations added in this PR are now importable; take a test json file - like the standard
dataset-create-new-all-default-fields.json
and try entries like... but a better test would likely be to look at the database entries for the DatasetFieldType "language" in the
controlledvocabularyvalue
andcontrolledvocabalternate
tables, before and after the metadatablock update, and confirm thata. the number of entries and the values in
controlledvocabularyvalue
are identical before and afterb. more values have been added to
controlledvocabalternate
, and all the values from before are still therec. that the end result (after) is identical to what's in these tables when the new block is installed on a brand new, empty database.
Not sure if it's worth going to such trouble - but figured I'd leave it here.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: