-
Notifications
You must be signed in to change notification settings - Fork 50
Create pre-filtered secondary indexes and add ability to automatically filter sensitive terms at query time #1108
Conversation
API Developer Docs Preview: Ready https://wordpress.github.io/openverse-api/_preview/1108 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. |
This PR will need to include an update to the "Search Algorithm" documentation describing this new feature. |
This is awesome! I had a random thought about how to test it (and potentially other changes like this) against production data: What if we only applied this to searches with a "secret password" in the query? So, it'd work something like this in production:
This would allow us to test in production for some arbitrary period so we could make performance comparisons. Of course we can already do things like this with the API and user permissions, but this way allows us to quickly compare the same query against the same API or frontend instance with different functionality enabled. |
@dhruvkb I was wondering if you could help me take a look at the ingestion server tests? I made some changes to the ordering so that it was easier to put new tests in the middle without having to update every other test's order number. That appears to be working fine, but no matter where I put the tests for creating the filtered index and pointing the alias in the order, it seems to cause subsequent tests that use the promote action to fail. I'm not entirely sure why. I'd appreciate any insight you might have into this issue. |
69a4756
to
00de216
Compare
16edff8
to
d3789b4
Compare
This is so cool! I'm really excited to test this with more data 😮 The 'password' idea from @zackkrida sounds really interesting. Is this PR meant as an exploration or do you intend to actively keep pushing this one? I have some questions but obviously not urgent if this is on the backburner. I understand the filtered index in the ingestion server. If I search "dog photo" with the mature filter enabled, I will end up with zero results because everything matching my query is also in the index. I will not receive partial match results (meaning, things that match "photo" but not "dog"). The filtering also happens on every query including ones that don't intentionally query on a sensitive term: so if I search just "photo", this time I'll get lots of results, but I still won't see any dog photos (or water photos for that matter). I am confused about the API filtering, though. It looks like it detects sensitive terms in the query params, and then excludes results that match only those terms? So, if one of the configured terms is "perched":
Is that interpretation of what it's supposed to do correct? Records matching sensitive terms defined in the API's list are only filtered when those terms actually appear in query params? I think I'm wrong about that, but I'm not sure what I'm missing. When I query |
I thought it could be merged, but I still have not received clarification from others on the team of whether they want that to happen (I asked during our retrospective). I haven't heard anyone say to "stop" working on this or that it shouldn't move forward, so assuming there aren't big problems with it, we could try it. Then again, if it's not something we think we would enable any time soon, then I should close the PR as an unmerged proof of concept to be referred back to later. Either way does not matter to me.
Your summary is incorrect but the behaviour you're seeing is reproducible. Just to clarify the behaviour first though: the code applies the sensitive word filter always to all queries, for all sensitive words, regardless of whether they appear in the query. In fact, it applies the filter in precisely the same way as the filtered index is produced, so the behaviour is (essentially) the same. This is the part of the code that applies the filter: https://github.com/WordPress/openverse-api/pull/1108/files#diff-1f1af6f89cdc3071047abe1d692e5803c38df879f2da130bf178dcb444cb8e28R345-R347 It doesn't check any query params aside from whether the mature filter is disabled. None of it's operation or implementation depends on any other query parameters or their values. There was, however, a bug in the environment variable reading implementation. It wasn't applying any sensitive term filters at runtime because the cast was creating a generator, not a tuple. I've fixed this now and the behaviour you were seeing is no longer reproducible. If you search "bird" you won't see and results for "perched". If you search "bird perched", you will also not see results for "perched". (Unless you disable the mature filter, to be clear). When you search, if you look at the logs, you'll be able to see the multimatch queries being sent at all times, regardless of what the other terms of the query are. |
d3789b4
to
d884011
Compare
Closing this PR again as I don't have any idea whether anyone else wants this to move forward, and I do not feel confident about getting it reviewed and merged before the 17th when @dhruvkb will be doing the monorepo migration. |
Noted. Leaving the comment I was working on for posterity when this is revisited. For the record I think this is really exciting.
That makes way more sense 😅 I was trying to make sense of my test behavior and assumed the filters must do different things and work together somehow. The context I was missing/had forgotten was in the comments of one of the linked issues: they do the same thing, but the reason for also having filtering in the API is to allow for adding additional sensitive terms in a hypothetical emergency without a redeploy. Thank you for the explanation! I'm inclined to agree with your comment about the API filtering possibly not being needed, especially to your point on this issue about the deploys being fairly quick and getting even better 😄 That said, you've already done the work and it works great! |
Fixes
Related to WordPress/openverse#721 by @zackkrida
Fixes WordPress/openverse#750 by @obulat (at least potentially, based on loose decisions we made last week during offline chats)
Description
Adds a new settings variable,
SENSITIVE_TERMS
. It should be a comma separated list of terms to exclude, parsed on application startup. This variable exists for both the ingestion server and the Django API.In the ingestion server, the terms are used to create a filtered index via the reindex API. I've updated
load_sample_data.sh
to created this filtered index and allow for easy local testing. Note that the filtered index uses the terms "dog" and "water".In the Django API, the terms are excluded from search via an inverted
MultiMatch
query. The approach is naive and may perform poorly in production. It may be necessary to connect a local box to the production Elasticsearch to try a query with "dog" excluded, for example, to see how it performs (or some other way to test, maybe using staging Elasticsearch cluster, cc @AetherUnbound @obulat @krysal who may all have better ideas for how to safely test). I also don't know if this will scale if we have, say, 100 or so terms. I'd wager it should be fine, but it's a completely naive guess founded on essentially nothing other than trusting Elasticsearch's ability to aggregate such queries.I don't know if this is the right approach. The significant alternative I can imagine is to store the list of terms in Postgres and caching their retrieval by 30 minutes to an hour (potentially busting the cache immediately when the model saves). This would allow us to change the terms without needing to redeploy. It makes sharing the list slightly harder because we'd have to give Django Admin access or export it from Django Admin. Leaving it as an environment variable allows the sharer to copy/paste the list out of the private infrastructure repository instead.
A production redeployment currently takes ~10 minutes and are tedious now but will be easier (and faster when staying on the same version) in the future once the ECS migration is completed.
Testing Instructions
By default, local environment is set up with two excluded terms for the filtered index: "dog" and "water". For the Django API, the excluded terms are "spoiled" and "perched". I'd advise searching these terms on
main
first, before running this branch, and noting the result counts and such for related queries. I tested using images so the following instructions only include specifics for images, but the same principles will apply for audio.Make a query for "dog" and "water" and pick out a separate word that would hit that document. For example, "running" will include 2 "water" documents (amongst others of people running).
Afterwards, run this branch and make the same queries with
mature=True
andmature=False
(the latter being the default). You should see different behaviour. When mature results are excluded, for "running", you should get 2 less documents, equivalent to searching "running -water". For "dogs" you should get no documents. When mature results are included via the query parameter, you'll receive the same results as onmain
.Again, those are for the images index, but the same principles will apply for audio.
To test the API terms, follow the same pattern of testing, but use the terms listed above for the query-time feature instead of the filtered index feature.
Checklist
Update index.md
).main
) ora parent feature branch.
errors.
Developer Certificate of Origin
Developer Certificate of Origin