Not all tags normalised to lower case #3342

sarayourfriend · 2023-11-12T22:23:55Z

Description

Tags should be normalised to lower case during cleanup. However, for some reason, some aren't. Here are an audio and image result with tags that aren't all lower case:

https://api.openverse.engineering/v1/audio/03ea0149-d9c5-47f3-97db-ed7f715b46af/
https://api.openverse.engineering/v1/images/604bc88f-ad55-4225-ae37-51dfce900a9b/

Here's some of the relevant code:

openverse/ingestion_server/ingestion_server/cleanup.py

Lines 129 to 130 in 680122f

    
           if "name" in tag and isinstance(tag["name"], str): 
        
               lower_tag = tag["name"].lower()

Looking at it further, and we have several works that do not have lowercased tags. I looked more at the code, and I think the tags cleanup might never run at all, so maybe just a bug in clean_image_data in the ingestion server.

However, also looking at it, and I can't see where we clean audio tags. Maybe there's a bigger problem here. @obulat any ideas?

Additional context

Marked as medium because it's a data quality issue, but I could see it being low as well. This only really starts to matter once collection queries are active because those use term queries for tags, which is the first time we'll have case and space sensitive queries against tags. This could be a blocker for that work, though, as it would cause unexpected behaviour ("dog" matching differently than "Dog").

The text was updated successfully, but these errors were encountered:

sarayourfriend · 2023-11-12T22:24:22Z

Marking this staff only because it will probably require digging into production data a bit to see what's going on with that specific work.

obulat · 2023-11-13T03:14:26Z

The cleanup functions in the ingestion server were written as a hot-fix solution for cleaning up data before using it in the API. The main data cleaning should be done in the catalog. Since the audio was added after all of the data shape changes were added, we didn't think that it was necessary to run the clean up in the ingestion server for audio.

We could add a cleanup step for audio doing only the steps necessary for audio (i.e., lower-casing the tags).

A better alternative would be to run 4 steps:

Temporarily add lower-casing to the elasticsearch_models in the ingestion server

openverse/ingestion_server/ingestion_server/elasticsearch_models.py

Line 202 in 37072c9

parsed_tag = {"name": tag["name"]}

Add tag clean up in the catalog's MediaStore:

openverse/catalog/dags/common/storage/media.py

Line 301 in 37072c9

return tag

Run batch-update DAG to lower-case tag.name in the audio table. It should not take too long since audio table is much smaller than image.
Remove the .lower() added in step 1.

sarayourfriend · 2023-11-13T04:49:54Z

Run batch-update DAG to lower-case tag.name in the audio table. It should not take too long since audio table is much smaller than image.

Images are also affected. Would we need to batch update there as well?

For what it's worth, we run tags.name through our custom text analyser, which includes casing to lower case. It's nice that we can keep tags in their original casing from the provider while still querying them in a normalised fashion. It's especially nice to be able to present tags with special capitalisation as they were intended by the creator, for example, abbreviations, place and people names, etc.

In other words: is the correct way to deal with this discrepancy to actually normalise tags as saved, which affects search as well as presentation, or only to normalise them in the index, which only affects search but preserves presentation? My inclination is strongly towards only affecting search and to leave presentation as the creator of the work intended. If we want search to always use lower case, even in the keyword and raw field, then we can move the tags lower-casing into the elasticsearch model creation. That would also preserve presentation and would allow us to treat search for tags as lowercase normalised in all cases, not just the base tags.name text field.

I want to avoid rushing towards a solution here that will irrevocably change our data unless we're sure that it's the right solution, especially from an ethical/presentational standpoint. Creators (or cataloguers) choose specific casing for aspects of a work, we should seek to preserve that in our presentation, if we can, even if we haven't in the past.

obulat · 2023-11-13T05:41:10Z

The more I think about it, the more complicated it becomes.

The tags are often user-created, and have different casing. I agree with you that lower-casing the tags in the catalog will make us lose valuable information.

One detail I also realized is that for Turkish, this will cause a problem because Turkish I is not converted to the correct lowercase letter by Python (it should be ı without a dot, but it's converted to i, which is İ in upper case). And it is not possible to determine whether we should be using Turkish or English for lowercasing a tag since we don't know the language of each tag.

This makes me think that we should actually remove all of the lower casing from the tag field, both image and audio, in catalog and in the ingestion server. The tag.name.keyword will keep all of the capitalization details. And the tag.name text field will be analyzed, as you say, @sarayourfriend, so when we search within tags, we get the necessary analysis, and can get the benefits of stemming. For the tag terms query, though, we would be getting the exact match, with capitalization and encoding details. If the tag is misspelled, then the user will get the results for a misspelled tag only. We could, in the future, extract all individual tags and add a "suggestions" feature to suggest a similar tag if there are not enough results for the selected one.

sarayourfriend · 2023-11-13T21:41:22Z

One detail I also realized is that for Turkish, this will cause a problem because Turkish I is not converted to the correct lowercase letter by Python (it should be ı without a dot, but it's converted to i, which is İ in upper case). And it is not possible to determine whether we should be using Turkish or English for lowercasing a tag since we don't know the language of each tag.

I was wondering if there were any languages where casing would be significant in this way.

And the tag.name text field will be analyzed, as you say, @sarayourfriend, so when we search within tags, we get the necessary analysis, and can get the benefits of stemming

We can also have multiple tags text fields, like one where we don't run it through the lowercase analysis. That will probably be necessary, given what you've said about Turkish (I'm sure other languages have similar issues) when we do localised search at some point in the future.

dhruvkb · 2023-11-14T15:39:45Z

I agree with tags should not be case normalised entirely because it encounters edge cases where normalisation can lose information contained in the casing of the characters (like when İ normalises incorrectly) and in the casing of the words (like for acronyms). If there is a way to have two fields or subfields like @sarayourfriend suggests, that seems better to me.

AetherUnbound · 2023-11-14T15:45:06Z

This makes me think that we should actually remove all of the lower casing from the tag field, both image and audio, in catalog and in the ingestion server. The tag.name.keyword will keep all of the capitalization details. And the tag.name text field will be analyzed, as you say, @sarayourfriend, so when we search within tags, we get the necessary analysis, and can get the benefits of stemming. For the tag terms query, though, we would be getting the exact match, with capitalization and encoding details. If the tag is misspelled, then the user will get the results for a misspelled tag only. We could, in the future, extract all individual tags and add a "suggestions" feature to suggest a similar tag if there are not enough results for the selected one.

We discussed this in the Make WP chat today, I think I'm totally on board with this approach - searches will benefit from Elasticsearch's stemming/analysis/normalization while the new tags view will be case specific. That seems ideal to me!

obulat · 2023-11-20T09:19:44Z

Since no change is necessary for this issue, I'm going to close it. Thank you everyone for your input in this discussion!

openverse-bot added this to Openverse Backlog Nov 12, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Nov 12, 2023

sarayourfriend mentioned this issue Nov 12, 2023

Add API routes and controllers for additional search views #2853

Merged

8 tasks

obulat mentioned this issue Nov 13, 2023

Lowercase media tags in MediaStore #3345

Closed

8 tasks

obulat mentioned this issue Nov 18, 2023

Additional search views for the frontend #410

Closed

6 tasks

obulat closed this as completed Nov 20, 2023

github-project-automation bot moved this from 📋 Backlog to ✅ Done in Openverse Backlog Nov 20, 2023

obulat mentioned this issue Dec 3, 2023

Use property-based testing for the API #3446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all tags normalised to lower case #3342

Not all tags normalised to lower case #3342

sarayourfriend commented Nov 12, 2023 •

edited

Loading

sarayourfriend commented Nov 12, 2023

obulat commented Nov 13, 2023

sarayourfriend commented Nov 13, 2023 •

edited

Loading

obulat commented Nov 13, 2023

sarayourfriend commented Nov 13, 2023

dhruvkb commented Nov 14, 2023 •

edited

Loading

AetherUnbound commented Nov 14, 2023

obulat commented Nov 20, 2023

Not all tags normalised to lower case #3342

Not all tags normalised to lower case #3342

Comments

sarayourfriend commented Nov 12, 2023 • edited Loading

Description

Additional context

sarayourfriend commented Nov 12, 2023

obulat commented Nov 13, 2023

sarayourfriend commented Nov 13, 2023 • edited Loading

obulat commented Nov 13, 2023

sarayourfriend commented Nov 13, 2023

dhruvkb commented Nov 14, 2023 • edited Loading

AetherUnbound commented Nov 14, 2023

obulat commented Nov 20, 2023

sarayourfriend commented Nov 12, 2023 •

edited

Loading

sarayourfriend commented Nov 13, 2023 •

edited

Loading

dhruvkb commented Nov 14, 2023 •

edited

Loading