-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sensitive terms list produces too many clauses for create filtered index call #2328
Comments
This is unfortunate. Increasing the limit to even the ES8 won't help us. As I understand the way the "clauses" calculation works, based on the SO link you shared @AetherUnbound, it's I can think of at least three options, one of them is horrific but I think would be a hacky workaround. The other would be a lot of work but probably more stable. And the last one is a big unknown, especially because we don't know exactly how to manage updating the cluster settings (maybe it's time to reach out for help provisioning an updated, ES8 cluster).
@zackkrida and @AetherUnbound, what do y'all think? Any ideas on which y'all would prefer? |
Considering that Elasticsearch 7 might go EOL by August 20231, I think we should prioritize migrating to ES8 anyways. I might not be very familiar with the context here, but I think reducing the number of terms is necessary. Creating a DAG that ingests the media sensitive information from the providers might improve the detection of sensitive media, even if it does not help with this specific issue. Footnotes
|
My initial suggestion would be that we try a test on the staging cluster where we dramatically increase I believe this is a per-index setting, based on the docs, so perhaps it can be updated without needing to restart the cluster? Since this will be executed on a single, scheduled query, I suspect it won't be nearly as resource intensive as most use cases for this setting 🤞 |
Which part of the docs says that it is a per-index setting? We may be able to use this endpoint to update the setting without restarting, though: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-update-settings.html However, I don't think updating the setting to 5799 (the current clause count of our query, 1933 terms * 3 fields) will be successful. If you read the ES PR I linked where they changed the calculation to be dynamic you can see it's quite precisely calculated to avoid OOM exceptions. I don't know what happens if we cause an OOM in our staging cluster.
My understanding of this setting has to do with whether the actual query even fits into memory in a single thread, at least based on that PR I linked. I'd be surprised if being a non-regular query makes a difference to that characteristic. I think it would be good to consult with some folks who understand ES better than we do at this point rather than potentially trashing our staging cluster with OOMs. At least I don't know what happens to the cluster if we do that and would prefer to avoid needing to rebuild it. |
@sarayourfriend I had a thought related to this. These indexes use the same stemmer config as the previous/main/normal indices do, right? I think that would mean that every phrase in our sensitive terms list gets stemmed, so something like "the foxes jumping quickly" would become [ the, fox, jump, quickli ]. We could potentially leverage this to remove redundancy from the sensitive terms list and get the number of terms down. Our sensitive terms list would become a list of stems and we'd probably be able to remove many duplicates. The only place it doesn't work well is with the numeric substitutions (d0g) and spelling variants that omit letters. Edit: I just tested this locally by stemming our entire terms list using the The list of 1933 produces 1533 unique stems. Alarmingly, a bunch of these stems are 1-2 leters, like "y", "z", "w", "up". Also innocent words like "love". This makes me wonder, is the sensitive index really matching on these stems?! It seems like this would mean there are many, many false positives. Apologies if the index or the multimatch query doesn't use stemming for some reason, and this tangent is a huge waste of time. |
We quote the terms specifically so that they do not get stemmed for this reason. There are too many opportunities for false positives. Multimatch can stem, we just explicitly prevent it from happening by quoting the terms.
If the second number was way lower, it might be useful if we could identify specific stems that did not overlap with innocuous terms. Right now my preference is for us to try the first approach: identify the top 341 terms that match the most results (and confirm that there aren't any potential big false positives in the list) and start off by using just those. 341 comes from 1024 / 3, the clause limit over the number of fields we query. I don't have time to run this experiment right now, but I could start working on it next week. This week I need to finish #2343 and get #2332 merged. I could delay starting on #1969 this week in favour of this. Considering this project is already >50% implemented, it probably makes sense to do so. Let me know what you think about this prioritisation @zackkrida or if you have someone else in mind who could work on this sooner. |
@sarayourfriend ah, of course, that's what the quoting is for 😌 I think the approach of significantly reducing the list is a sound one. By nature this project is about having the biggest impact to safety possible while understanding it isn't going to capture everything, so working within this limitation feels fine. I do think it would make sense to wait on #1969 to focus on this project which is closer to completion. The terms experiment seems like it'd be pretty straightforward (write a one-off python script, maybe as a DAG, to iterate through the terms, query ES, and print the record count of each term) but is the amount of time it would take to run feasible? It would have to iterate through all the records 1933 times right? I suppose if batches of the queries were parallelized it might be faster...edit: wait, they're just queries, not reindexing, so it would be much faster than i was originally thinking Anyway, I think that prioritization makes sense, thanks for confirming. |
Yup! Individual Multimatch queries should be very fast. I'll just do it in a local Python script (which I'll share in a gist here) and share the results as well. Thanks for the help prioritising this. I'll do a bit of proofreading and then suspect I should be able to have this in semi-workable state by my afternoon 🚀 |
@sarayourfriend I got a little excited and just tried an experiment myself. I wrote up some initial findings in the sensitive terms repo warning: contains sensitive text, naturally: WordPress/openverse-sensitive-terms#3 |
Thanks! I left a comment on that issue as well after doing a bit of research based on what you found, and we might be able to solve this clause issue by switching to a terms query against new, non-analysed versions of the three fields. I'll update here later if that's possible. It would also solve the problem in the issue you've linked. |
After further investigation, it does look like Zack and I have found an important change that also fixes this issue. Once a text field is analysed by ES, you can only query the tokenised version. That's to say, there's no way to do "exact matches" on an analysed field. In the issue Zack linked (Issue 3 in the sensitive terms repository), we've shown that adding unanalysed versions of the three fields we query for sensitive terms and then switching to ES The solution for this issue, therefore, is to add the following additional fields to
e.g.,:
Additionally, we need to update the query the
In a related issue, this would also fix exact queries in the API if we switched to querying the raw fields and using a terms query when the query string starts and ends with |
Airflow log link
https://airflow.openverse.engineering/log?execution_date=2023-06-04T00%3A32%3A02.977074%2B00%3A00&task_id=wait_for_create_and_populate_filtered_index&dag_id=create_filtered_audio_index&map_index=-1
Exception from the ingestion server:
Description
The most recent audio data refresh that ran over the weekend was the first run using the new sensitive terms list (see https://github.com/WordPress/openverse-infrastructure/pull/522). This sensitive terms list (not linked due to the sensitive nature) is almost 2k items long.
The default clause count in ES7 is 1024, but it apparently was raised to 4096 in ES8.
We will need to either find a workaround or increase this configuration value for the cluster in order to proceed.
DAG status
Left enabled since pausing it may affect the data refresh itself.
The text was updated successfully, but these errors were encountered: