Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search controller uses should for filtering instead of filter #3243

Closed
obulat opened this issue Oct 24, 2023 · 10 comments · Fixed by #3261
Closed

Search controller uses should for filtering instead of filter #3243

obulat opened this issue Oct 24, 2023 · 10 comments · Fixed by #3261
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🔧 tech: elasticsearch Involves Elasticsearch 🐍 tech: python Involves Python

Comments

@obulat
Copy link
Contributor

obulat commented Oct 24, 2023

Description

If you filter the search by license, the search controller creates the following query:

 "filter": [ { "bool": { "should": [
              { "terms": {
                  "license": ["by", "by-sa", "by-nd", "by-nc", "by-nc-nd"],
                  "boost": 1.0
                }  }
            ] } },
       ...
      ],

There are two problems with it:

  • it uses should, which means that the results with other licenses can be returned (it's not a mandatory filter)
  • it uses a boost parameter which is unnecessary: instead of increasing the score of the item that has the required licenses, we should just filter out all of the items that don't have the required license.

Solution

We should replace should with filter here:

if serializer_field in search_params.data:
arguments = search_params.data.get(serializer_field)
if arguments is None:
return s
arguments = arguments.split(",")
parameter = es_field or serializer_field
query = Q("terms", **{parameter: arguments})
method = getattr(s, behaviour)
return method("bool", should=query)
return s

should be:

method = getattr(s, behaviour) 
return method(query) 

This will use elasticsearch_dsl's filter or exclude methods to create filter or must_not clauses.

Reproduction

  1. Add a line `DEBUG_SCORES=True` to the `api/.env` file
  2. Search for items with a license filter: api.localhost:50280/v1/images/?license=by,by-sa
  3. Check the logs for the `web` docker container. It should show the query similar to the one in this issue.
@obulat obulat added help wanted Open to participation from the community 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🐍 tech: python Involves Python 🧱 stack: api Related to the Django API 🔧 tech: elasticsearch Involves Elasticsearch labels Oct 24, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Oct 24, 2023
@obulat obulat moved this from 📋 Backlog to 📅 To do in Openverse Backlog Oct 24, 2023
@AetherUnbound AetherUnbound added the good first issue New-contributor friendly label Oct 24, 2023
@ngken0995
Copy link
Collaborator

@obulat @AetherUnbound Can I work on this?

@obulat
Copy link
Contributor Author

obulat commented Oct 24, 2023

@obulat @AetherUnbound Can I work on this?

Yes, please! :)
Note that I removed the part about nesting because it was incorrect.

@obulat obulat changed the title Search controller uses should for filtering instead of must Search controller uses should for filtering instead of filter Oct 25, 2023
@obulat
Copy link
Contributor Author

obulat commented Oct 25, 2023

@ngken0995, I updated the issue again, replacing must with filter because must would be incorrect here: it uses a "query context" and calculates a score for the matches instead of simply filtering out the non-matching results.

@ngken0995
Copy link
Collaborator

ngken0995 commented Oct 25, 2023

  1. Search for items with a license filter: api.localhost:50280/v1/images/?license=by,by-sa

@obulat I have add DEBUG_SCORES=True to the api/.env file. Where do I input api.localhost:50280/v1/images/?license=by,by-sa. Should it be in a web browser, postman or docker?

@obulat
Copy link
Contributor Author

obulat commented Oct 26, 2023

@ngken0995, you can use any browser to open this address: http://api.localhost:50280/v1/images/?license=by,by-sa.

@ngken0995
Copy link
Collaborator

ngken0995 commented Oct 26, 2023

@obulat

run just api/up
input url http://api.localhost:50280/v1/images/?license=by,by-sa in firefox url.

I get this response below:
image

@obulat
Copy link
Contributor Author

obulat commented Oct 26, 2023

@ngken0995 , could you run just env to create the .env files in the repository, and then check that your api/.env file contains a line like this:

ALLOWED_HOSTS=localhost,172.17.0.1,host.docker.internal

@obulat obulat moved this from 📅 To do to 🏗 In progress in Openverse Backlog Oct 26, 2023
@obulat
Copy link
Contributor Author

obulat commented Oct 26, 2023

@ngken0995, I know you've started working on this issue, but I think I've created a better and more comprehensive solution to the ES search query building. Would it be okay if I assigned this issue to myself? Sorry again for not telling this to you in advance.

@obulat obulat mentioned this issue Oct 26, 2023
8 tasks
@ngken0995
Copy link
Collaborator

@obulat Go ahead on creating a PR. I was stuck on reproducing the error. I'm trying to understand the system design on openverse. First, I want to create a PR related to catalog to learn about DAGs, airflow, postgres and etc. Second, drive into ingestion server. Then I saw this issue which could help me understand ES. Could I reach out to you if I have any questions about catalog or ingestion server?

@obulat
Copy link
Contributor Author

obulat commented Oct 26, 2023

@obulat Go ahead on creating a PR. I was stuck on reproducing the error. I'm trying to understand the system design on openverse. First, I want to create a PR related to catalog to learn about DAGs, airflow, postgres and etc. Second, drive into ingestion server. Then I saw this issue which could help me understand ES. Could I reach out to you if I have any questions about catalog or ingestion server?

Thank you, @ngken0995!
You are grasping all of the stack really quickly 🚀
Sure, please feel free to reach out to me with questions either on GitHub or on Make Slack, I'd be happy to answer them!

@obulat obulat assigned obulat and unassigned ngken0995 Oct 26, 2023
@obulat obulat removed good first issue New-contributor friendly help wanted Open to participation from the community labels Oct 27, 2023
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Openverse Backlog Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🔧 tech: elasticsearch Involves Elasticsearch 🐍 tech: python Involves Python
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants