-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move "by" tag contains filter to tag exact match filter #4464
Comments
👍 to In line with the discussion of #4417 and into #4456, if we will soon have the ability to filter tags later in our data lifecycle, it'd be great to not do any kind of wholesale exclusion of tags at this point. My understanding of some of the requirements of #4456, in line with the requirements of #4417 not to include certain tags in search but keep them in our data, would lead me to think that excluding these tags at all is against that trend? I suppose I'm making an assumption about how those discussions will play out. If we move this later on, though, we'll have greater flexibility to play with what tags we use for search, without precluding the possibility of tags not directly used in search still being useful. Perhaps the existence of some of these tags, even if they aren't part of the searched text, is a useful indicator of some quality of the work that would be relevant for some subset of searches (if not all searches). If the task for this issue is just to move this check into the other set, fine for now, but perhaps an additional issue would be good to also stop throwing these tags away at ingestion time (if that's how they're being excluded). |
I was about to ask exactly this because, from sync discussions, I had understood that we wanted to preserve tags with low accuracy and move forward with cleaning deny-listed tags in the catalog since we're already excluding them in new records. Hence, I created #4453 (blocked while this is clarified). If the desire is to keep everything, then #4453 would not be necessary. |
We are moving towards that, I assume Madison opened this issue as something we can resolve as a separate problem, rather than also solving it in that issue (or waiting to solve it after that issue). As things are, we are incorrectly excluding the types of tags Madison mentioned in the issue. Why block this on overall changes? |
Perfect, that is all the confirmation I needed. Thanks!
I wasn't referring to blocking this issue (#4464) but #4453, which is unnecessary, so it's closed now. Sorry for the misunderstanding. I agree with moving the |
I'd like to work on this |
Thanks @sarayourfriend and @krysal for piping in - yes, I agree that we should stop filtering tags at ingestion in the future once we move the filtering out of the ingestion server (see #4524). I'll make an issue for follow-up! Edit: issue made - #4542 |
Description
Looking over our transformations, it seems we have two separate tag filtering steps that happen on the
MediaStore
class: full string matching and substring matching. The latter case, however, has the valueby
as a substring match option. Whereas all of the other options feel fairly unique and pointed, the valueby
could be part of so many valid tag possibilities that it seems like it should be part of the full string matching instead. Examples being: "by the sea", "colby jack", "byte array", etc.We should move this value from
TAG_CONTAINS_DENYLIST
toTAG_DENYLIST
in this file:openverse/catalog/dags/common/storage/media.py
Lines 15 to 37 in d500b77
The text was updated successfully, but these errors were encountered: