-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not all tags normalised to lower case #3342
Comments
Marking this staff only because it will probably require digging into production data a bit to see what's going on with that specific work. |
The We could add a A better alternative would be to run 4 steps:
|
Images are also affected. Would we need to batch update there as well? For what it's worth, we run In other words: is the correct way to deal with this discrepancy to actually normalise tags as saved, which affects search as well as presentation, or only to normalise them in the index, which only affects search but preserves presentation? My inclination is strongly towards only affecting search and to leave presentation as the creator of the work intended. If we want search to always use lower case, even in the I want to avoid rushing towards a solution here that will irrevocably change our data unless we're sure that it's the right solution, especially from an ethical/presentational standpoint. Creators (or cataloguers) choose specific casing for aspects of a work, we should seek to preserve that in our presentation, if we can, even if we haven't in the past. |
The more I think about it, the more complicated it becomes. The tags are often user-created, and have different casing. I agree with you that lower-casing the tags in the catalog will make us lose valuable information. One detail I also realized is that for Turkish, this will cause a problem because Turkish This makes me think that we should actually remove all of the lower casing from the tag field, both image and audio, in catalog and in the ingestion server. The |
I was wondering if there were any languages where casing would be significant in this way.
We can also have multiple tags text fields, like one where we don't run it through the lowercase analysis. That will probably be necessary, given what you've said about Turkish (I'm sure other languages have similar issues) when we do localised search at some point in the future. |
I agree with tags should not be case normalised entirely because it encounters edge cases where normalisation can lose information contained in the casing of the characters (like when İ normalises incorrectly) and in the casing of the words (like for acronyms). If there is a way to have two fields or subfields like @sarayourfriend suggests, that seems better to me. |
We discussed this in the Make WP chat today, I think I'm totally on board with this approach - searches will benefit from Elasticsearch's stemming/analysis/normalization while the new tags view will be case specific. That seems ideal to me! |
Since no change is necessary for this issue, I'm going to close it. Thank you everyone for your input in this discussion! |
Description
Tags should be normalised to lower case during cleanup. However, for some reason, some aren't. Here are an audio and image result with tags that aren't all lower case:
https://api.openverse.engineering/v1/audio/03ea0149-d9c5-47f3-97db-ed7f715b46af/
https://api.openverse.engineering/v1/images/604bc88f-ad55-4225-ae37-51dfce900a9b/
Here's some of the relevant code:
openverse/ingestion_server/ingestion_server/cleanup.py
Lines 129 to 130 in 680122f
Looking at it further, and we have several works that do not have lowercased tags. I looked more at the code, and I think the tags cleanup might never run at all, so maybe just a bug in
clean_image_data
in the ingestion server.However, also looking at it, and I can't see where we clean audio tags. Maybe there's a bigger problem here. @obulat any ideas?
Additional context
Marked as medium because it's a data quality issue, but I could see it being low as well. This only really starts to matter once collection queries are active because those use term queries for tags, which is the first time we'll have case and space sensitive queries against tags. This could be a blocker for that work, though, as it would cause unexpected behaviour ("dog" matching differently than "Dog").
The text was updated successfully, but these errors were encountered: