Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all tags normalised to lower case #3342

Closed
sarayourfriend opened this issue Nov 12, 2023 · 8 comments
Closed

Not all tags normalised to lower case #3342

sarayourfriend opened this issue Nov 12, 2023 · 8 comments
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: ingestion server Related to the ingestion/data refresh server 🔒 staff only Restricted to staff members

Comments

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Nov 12, 2023

Description

Tags should be normalised to lower case during cleanup. However, for some reason, some aren't. Here are an audio and image result with tags that aren't all lower case:

https://api.openverse.engineering/v1/audio/03ea0149-d9c5-47f3-97db-ed7f715b46af/
https://api.openverse.engineering/v1/images/604bc88f-ad55-4225-ae37-51dfce900a9b/

Here's some of the relevant code:

if "name" in tag and isinstance(tag["name"], str):
lower_tag = tag["name"].lower()

Looking at it further, and we have several works that do not have lowercased tags. I looked more at the code, and I think the tags cleanup might never run at all, so maybe just a bug in clean_image_data in the ingestion server.

However, also looking at it, and I can't see where we clean audio tags. Maybe there's a bigger problem here. @obulat any ideas?

Additional context

Marked as medium because it's a data quality issue, but I could see it being low as well. This only really starts to matter once collection queries are active because those use term queries for tags, which is the first time we'll have case and space sensitive queries against tags. This could be a blocker for that work, though, as it would cause unexpected behaviour ("dog" matching differently than "Dog").

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: ingestion server Related to the ingestion/data refresh server 🔒 staff only Restricted to staff members labels Nov 12, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Nov 12, 2023
@sarayourfriend
Copy link
Collaborator Author

Marking this staff only because it will probably require digging into production data a bit to see what's going on with that specific work.

@obulat
Copy link
Contributor

obulat commented Nov 13, 2023

The cleanup functions in the ingestion server were written as a hot-fix solution for cleaning up data before using it in the API. The main data cleaning should be done in the catalog. Since the audio was added after all of the data shape changes were added, we didn't think that it was necessary to run the clean up in the ingestion server for audio.

We could add a cleanup step for audio doing only the steps necessary for audio (i.e., lower-casing the tags).

A better alternative would be to run 4 steps:

  1. Temporarily add lower-casing to the elasticsearch_models in the ingestion server

parsed_tag = {"name": tag["name"]}

  1. Add tag clean up in the catalog's MediaStore:

  1. Run batch-update DAG to lower-case tag.name in the audio table. It should not take too long since audio table is much smaller than image.

  2. Remove the .lower() added in step 1.

@sarayourfriend
Copy link
Collaborator Author

sarayourfriend commented Nov 13, 2023

Run batch-update DAG to lower-case tag.name in the audio table. It should not take too long since audio table is much smaller than image.

Images are also affected. Would we need to batch update there as well?

For what it's worth, we run tags.name through our custom text analyser, which includes casing to lower case. It's nice that we can keep tags in their original casing from the provider while still querying them in a normalised fashion. It's especially nice to be able to present tags with special capitalisation as they were intended by the creator, for example, abbreviations, place and people names, etc.

In other words: is the correct way to deal with this discrepancy to actually normalise tags as saved, which affects search as well as presentation, or only to normalise them in the index, which only affects search but preserves presentation? My inclination is strongly towards only affecting search and to leave presentation as the creator of the work intended. If we want search to always use lower case, even in the keyword and raw field, then we can move the tags lower-casing into the elasticsearch model creation. That would also preserve presentation and would allow us to treat search for tags as lowercase normalised in all cases, not just the base tags.name text field.

I want to avoid rushing towards a solution here that will irrevocably change our data unless we're sure that it's the right solution, especially from an ethical/presentational standpoint. Creators (or cataloguers) choose specific casing for aspects of a work, we should seek to preserve that in our presentation, if we can, even if we haven't in the past.

@obulat
Copy link
Contributor

obulat commented Nov 13, 2023

The more I think about it, the more complicated it becomes.

The tags are often user-created, and have different casing. I agree with you that lower-casing the tags in the catalog will make us lose valuable information.

One detail I also realized is that for Turkish, this will cause a problem because Turkish I is not converted to the correct lowercase letter by Python (it should be ı without a dot, but it's converted to i, which is İ in upper case). And it is not possible to determine whether we should be using Turkish or English for lowercasing a tag since we don't know the language of each tag.

This makes me think that we should actually remove all of the lower casing from the tag field, both image and audio, in catalog and in the ingestion server. The tag.name.keyword will keep all of the capitalization details. And the tag.name text field will be analyzed, as you say, @sarayourfriend, so when we search within tags, we get the necessary analysis, and can get the benefits of stemming. For the tag terms query, though, we would be getting the exact match, with capitalization and encoding details. If the tag is misspelled, then the user will get the results for a misspelled tag only. We could, in the future, extract all individual tags and add a "suggestions" feature to suggest a similar tag if there are not enough results for the selected one.

@sarayourfriend
Copy link
Collaborator Author

One detail I also realized is that for Turkish, this will cause a problem because Turkish I is not converted to the correct lowercase letter by Python (it should be ı without a dot, but it's converted to i, which is İ in upper case). And it is not possible to determine whether we should be using Turkish or English for lowercasing a tag since we don't know the language of each tag.

I was wondering if there were any languages where casing would be significant in this way.

And the tag.name text field will be analyzed, as you say, @sarayourfriend, so when we search within tags, we get the necessary analysis, and can get the benefits of stemming

We can also have multiple tags text fields, like one where we don't run it through the lowercase analysis. That will probably be necessary, given what you've said about Turkish (I'm sure other languages have similar issues) when we do localised search at some point in the future.

@dhruvkb
Copy link
Member

dhruvkb commented Nov 14, 2023

I agree with tags should not be case normalised entirely because it encounters edge cases where normalisation can lose information contained in the casing of the characters (like when İ normalises incorrectly) and in the casing of the words (like for acronyms). If there is a way to have two fields or subfields like @sarayourfriend suggests, that seems better to me.

@AetherUnbound
Copy link
Collaborator

This makes me think that we should actually remove all of the lower casing from the tag field, both image and audio, in catalog and in the ingestion server. The tag.name.keyword will keep all of the capitalization details. And the tag.name text field will be analyzed, as you say, @sarayourfriend, so when we search within tags, we get the necessary analysis, and can get the benefits of stemming. For the tag terms query, though, we would be getting the exact match, with capitalization and encoding details. If the tag is misspelled, then the user will get the results for a misspelled tag only. We could, in the future, extract all individual tags and add a "suggestions" feature to suggest a similar tag if there are not enough results for the selected one.

We discussed this in the Make WP chat today, I think I'm totally on board with this approach - searches will benefit from Elasticsearch's stemming/analysis/normalization while the new tags view will be case specific. That seems ideal to me!

@obulat
Copy link
Contributor

obulat commented Nov 20, 2023

Since no change is necessary for this issue, I'm going to close it. Thank you everyone for your input in this discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: ingestion server Related to the ingestion/data refresh server 🔒 staff only Restricted to staff members
Projects
Archived in project
Development

No branches or pull requests

4 participants