Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API routes and controllers for additional search views #2853

Merged
merged 16 commits into from
Nov 16, 2023

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Aug 21, 2023

Fixes

Fixes #2849 by @obulat
Fixes #2850 by @obulat

Note

This PR was changed a lot after reviews, mainly removing the query parameters, so I wrote a new description in this comment. The old description is left here for the record.

Description

This PR adds the endpoints for additional search views for both audio and image. The endpoints start with /v1/images or /v1/audio:

  • Tag view: /tag/tag_name
  • Source view: /source/source_name
  • Creator view: /source/source_name/creator/creator_name.
    These endpoints return paginated results because they accept page query parameter. They also support query parameters such as license, license_type, category, [image-only] aspect_ratio, size, [audio-only] length, peaks. Currently, the frontend plans and designs do not allow for querying the collections using these filter parameters, but we can add this feature later.

To enable this feature, I had to refactor the media serializers and the search controller.

Search controller

I extracted functions that are common for both the collection and search requests.
Both the collections and the search endpoints now call the search_controller.query_media function. query_media builds the relevant query (collection or search), sets up other search parameters, executes the search, tallies the results and builds the search context.
The new build_collection_query creates a query from a dictionary of the possible query predicates (filter, must_not, should and must dictionaries). All of these dictionaries are then combined with a Bool query. This makes the query much less nested than the query that we currently create in the search method in the search_controller.

Serializers

This PR adds a MediaCollectionRequestSerializer to allow for filtering the collections using the filters such as license or category. This would allow us to make large collections easier to browse by various criteria such as the license or category.

This new serializer reuses some common functionality from the MediaSearchRequestSerializer. To make it possible, I had to extract the common functionality into MediaListRequestSerializer.

Sorting of the results

The results are sorted by created_on. The tag results also add the boost for unstable__authority (for the authorized requests).

Testing Instructions

Run the app using just up.
Try various collection routes:

  1. http://localhost:50280/v1/images/source/stocksnap (source collection, only the sources that are present in the sample data can be used to test locally)
  2. http://localhost:50280/v1/images/source/flickr/creator/Manzabar/ (creator collection)
  3. http://localhost:50280/v1/images/tag/cat (tag collection, images)
  4. http://localhost:50280/v1/audio/tag/birds (tag collection, audio)
  5. http://localhost:50280/v1/audio/tag/birds/?extension=flac (additional query parameter)
  6. http://localhost:50280/v1/images/source/flickr/?page=2 (second page of results)

Try the routes that should return errors:

  1. http://localhost:50280/v1/images/source/met (We don't have anything from met in the sample data, so the source validator throws an error
  2. http://localhost:50280/v1/images/source/flickr,stocksnap (The path parameters are not split by a comma, so only a single parameter is allowed)

If there are no items corresponding to the requested values, the empty collection can be returned:
http://localhost:50280/v1/images/source/flickr/creator/Manzaba/ (the creator name is missing the last r letter)

Since the image and audio search had to be modified, too, check that they work as expected (the CI should check that, too).

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@obulat obulat self-assigned this Aug 21, 2023
@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label Aug 21, 2023
@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Aug 21, 2023
@obulat obulat force-pushed the additional_search_views/collection_routes branch 3 times, most recently from 8e15d88 to 1804d4c Compare August 29, 2023 11:28
@obulat obulat force-pushed the additional_search_views/collection_routes branch from 1804d4c to 348a941 Compare September 4, 2023 16:17
@obulat obulat force-pushed the additional_search_views/collection_routes branch 2 times, most recently from 0047964 to 2f56256 Compare September 7, 2023 12:05
@obulat obulat marked this pull request as ready for review September 7, 2023 16:23
@obulat obulat requested a review from a team as a code owner September 7, 2023 16:23
@obulat obulat requested review from krysal and stacimc September 7, 2023 16:23
Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this generally looks good — all the tests in the PR description passed (although I have some questions below about additional query parameters). A few times while testing this I had an API request take a very long time (> 10 seconds), for example on a creator search and when getting /source/flickr, but I couldn’t reproduce it consistently and it seemed likely that it was a local env problem. Mentioning in case other folks also notice it.

For the creator collections: I noticed that some records have a _url _for creator (example sample data with id 040db469-9f73-4fed-b4d3-66592d554eb0, creator is https://www.rosanetur.com). Should those be handled any differently?

__

I don’t want to take up time rehashing things that were explained in the IP stage, but I was pretty confused about the intended behavior and how it differs from the existing search parameters. For example, how does searching http://localhost:50280/v1/images/source/flickr/creator/Manzabar differ from the existing http://localhost:50280/v1/images/?source=flickr&creator=Manzabar? Is it that the authority/popularity boost isn’t applied in the new route? Likewise, how does the tag collection differ from querying something like http://localhost:50280/v1/audio/?source=jamendo&tags=acoustic ? Is it mostly that the tags aren’t ‘fuzzy’ matched? (I played around and got results for ‘acoustical’, for example, when searching ?tags=acoustic). Or is it a performance difference?

I also wasn’t clear what additional filters can be applied on the collection views. So for example:

I haven’t been involved in this project so I think I’m just missing a lot of context (especially, about how our own frontend actually consumes our API) — as I said, I don’t want to take up all your time time having things re-explained, sorry! But I think if it’s unclear to me from the API documentation, it might be confusing to others as well? Could we update the <media>_source / _source_creator / _tag documentation to include (a) what additional query parameters are available for each of those routes and (b) how they differ from the equivalent search?

api/api/serializers/audio_serializers.py Outdated Show resolved Hide resolved
api/api/serializers/media_serializers.py Outdated Show resolved Hide resolved
api/api/views/media_views.py Outdated Show resolved Hide resolved
api/api/serializers/media_serializers.py Outdated Show resolved Hide resolved
api/api/serializers/media_serializers.py Outdated Show resolved Hide resolved
@obulat
Copy link
Contributor Author

obulat commented Sep 12, 2023

Thank you for your review, @stacimc. Your comments about confusions due to lack of context are especially valuable because they help make the code and docs clearer - I can't see what isn't clear about it as I'm so immersed in the context now :)

A few times while testing this I had an API request take a very long time (> 10 seconds), for example on a creator search and when getting /source/flickr, but I couldn’t reproduce it consistently and it seemed likely that it was a local env problem. Mentioning in case other folks also notice it.

I *also noticed the requests taking a long time locally, so I guess this is something I should investigate.

For the creator collections: I noticed that some records have a _url _for creator (example sample data with id 040db469-9f73-4fed-b4d3-66592d554eb0, creator is https://www.rosanetur.com). Should those be handled any differently?

For the creator collections, we want to show all of the items by the selected creator in Openverse, and also link to their external url, if it's available, with the "Open creator page" button:

Screenshot 2023-09-12 at 8 08 18 AM

For example, how does searching http://localhost:50280/v1/images/source/flickr/creator/Manzabar differ from the existing http://localhost:50280/v1/images/?source=flickr&creator=Manzabar? Is it that the authority/popularity boost isn’t applied in the new route?

The existing filters match fuzzily by a stemmed value of the query. It's not so easy to understand with "Manzabar". A better example would be "photo": if you try searching for "photo" as creator, you will see that you'll get all items for which the creator has the words "photo" or "photos" in the "creator" field: https://api.openverse.engineering/v1/images/?source=flickr&creator=photo
The http://localhost:50280/v1/images/source/flickr/creator/photo endpoint will return only the exact matches for the creator whose name is "photo", and would not return images by "Obtuse Photo" or "JSmith Photo", as the current search using the query parameter does. The IP describes that if the user clicks on the creator button on the single result page, they would expect to see the media by this specific creator, not by the creators that contain the word in their names:
Screenshot 2023-09-12 at 8 17 35 AM

I suppose that there will also be the performance difference, since the boolean filter check (whether creator is equal to "photo") should be faster than a search within the field (whether "photo" is contained in the creator field). That is a side effect for this change, though, with the main reason being the exact match.

I also wasn’t clear what additional filters can be applied on the collection views.

The IP did not mention anything about the filters since the current designs do not show filtering, only displaying all of the items by the creator/source/tag. However, when I started making the changes in the API, I realized that we might want to filter the results (especially the Flickr images, for instance :) ). This is why I left filtering in the collections.

At first, I thought that I should remove the creator/source/tag as filters since the views are already filtered by these parameters. I mainly wanted to have the licenses, categories, extensions, lengths, sensitivity as filters. This is why in your examples filtering by tags does not work. The creator filter (http://localhost:50280/v1/audio/source/freesound/?creator=KTManahan) wasn't supposed to work, oops...

After your comment, though, I realized that we might want to, for instance, filter the creator view by a tag, or filter a tag view by a source. I'll look into how I can add this to the code. If it turns out to require too many changes, this should probably be implemented in another PR.
Note that the filtering will not be available on the frontend as the designs (and the IP) don't include implementation of filtering.

I haven’t been involved in this project so I think I’m just missing a lot of context (especially, about how our own frontend actually consumes our API) — as I said, I don’t want to take up all your time time having things re-explained, sorry! But I think if it’s unclear to me from the API documentation, it might be confusing to others as well? Could we update the <media>_source / _source_creator / _tag documentation to include (a) what additional query parameters are available for each of those routes and (b) how they differ from the equivalent search?

As I said, these requests for clarification are very valuable, thank you! I'll add more documentation to make it clearer.

@obulat obulat force-pushed the additional_search_views/collection_routes branch from 93c663c to fbb1cec Compare September 12, 2023 09:23
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking good! I'm leaving a +1 to @stacimc questions.

The existing filters match fuzzily by a stemmed value of the query. It's not so easy to understand with "Manzabar". A better example would be "photo": if you try searching for "photo" as creator, you will see that you'll get all items for which the creator has the words "photo" or "photos" in the "creator" field: https://api.openverse.engineering/v1/images/?source=flickr&creator=photo
The http://localhost:50280/v1/images/source/flickr/creator/photo endpoint will return only the exact matches for the creator whose name is "photo", and would not return images by "Obtuse Photo" or "JSmith Photo", as the current search using the query parameter does.

This explanation makes it clear the necessity of the new creator endpoint. Thank you! In this section, I'm only missing the tests of the new views. That would help understand the differences in the code.

The IP describes that if the user clicks on the creator button on the single result page, they would expect to see the media by this specific creator, not by the creators that contain the word in their names:

If there are creators with the same name within a source, then the endpoint will merge their works in the results, right? I'm okay with this being a first approach but then it will take more work to strictly comply with what the plan says.

api/api/serializers/audio_serializers.py Outdated Show resolved Hide resolved
@stacimc
Copy link
Collaborator

stacimc commented Sep 13, 2023

Thanks so much for your explanation, @obulat, now I understand why the new endpoints are needed :)

For the creator collections, we want to show all of the items by the selected creator in Openverse, and also link to their external url, if it's available, with the "Open creator page" button:

To clarify what I meant -- so for most records, the creator is a name and then creator_url is the external url (eg "John Smith", "www.provider-site/users/john-smith" or whatever). I had noticed that at least in our sample data we have some records where both creator and creator_url are urls. In the example I gave, https://www.rosanetur.com is the 'creator' for that record (and creator_url is some other url). So my question is, what does that look like for fetching that creator's collection? Since naively, it would be http://localhost:50280/v1/images/source/flickr/creator/https://www.rosanetur.com/. How should that be urlencoded?

@obulat obulat force-pushed the additional_search_views/collection_routes branch 3 times, most recently from 397bbdb to fa656fc Compare September 19, 2023 05:15
@obulat obulat force-pushed the additional_search_views/collection_routes branch from fa656fc to 0e704e5 Compare September 19, 2023 07:13
@obulat obulat changed the base branch from main to fix/pagination_examples September 19, 2023 07:13
@obulat
Copy link
Contributor Author

obulat commented Sep 19, 2023

I rebased this PR on to #3039 to make the API documentation debugging easier.

@obulat obulat force-pushed the additional_search_views/collection_routes branch from a3ff695 to 99d64b9 Compare September 19, 2023 11:06
@obulat
Copy link
Contributor Author

obulat commented Sep 19, 2023

If there are creators with the same name within a source, then the endpoint will merge their works in the results, right? I'm okay with this being a first approach but then it will take more work to strictly comply with what the plan says.

I wrote the code with the assumption that a single source does not have any creators with identical names. I think this is a fair assumption: for instance, you probably cannot register with a name that is already taken on Flickr.
There could be creators with the same name on different providers. They may or may not be the same creators. For instance, different people could be using the name "Olga" on Stocksnap and Flickr. At the same time, Rembrandt's works may be published by different sources. But we cannot make assumptions that they are the same (or different) creators, so the only possible way is to show the media by a specific creator on a specific source site.

@krysal, I added some tests for the collection parameters, but I'm not sure what kind of tests I can add for collections. Could you please elaborate on what kind of tests you think are necessary here?

To clarify what I meant -- so for most records, the creator is a name and then creator_url is the external url (eg "John Smith", "www.provider-site/users/john-smith" or whatever). I had noticed that at least in our sample data we have some records where both creator and creator_url are urls. In the example I gave, https://www.rosanetur.com is the 'creator' for that record (and creator_url is some other url). So my question is, what does that look like for fetching that creator's collection? Since naively, it would be http://localhost:50280/v1/images/source/flickr/creator/https://www.rosanetur.com/. How should that be urlencoded?

Thank you so much for explaining this in detail, @stacimc! I wasn't aware that the creator names can be URLs or can have other special symbols. I updated the creator endpoint regex to allow for /. You can now query by such a creator: localhost:50280/v1/images/source/flickr/creator/https://www.rosanetur.com/, or even filter them by some filter parameter: http://localhost:50280/v1/images/source/flickr/creator/https://www.rosanetur.com/?aspect_ratio=tall

I updated the DRF spectacular documentation (e.g. http://localhost:50280/v1/#tag/images/operation/images_source_creator)
Each endpoint now documents the path parameters and the query parameters, and shows the sample responses. Please let me know if this is insufficient and how to update it.

@obulat obulat marked this pull request as ready for review November 10, 2023 07:28
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I haven't had a chance to test this locally yet, and will do so right after leaving this review. I wanted to make sure I left these comments now though, just in case I get interrupted later today while testing locally.

The only requested changes are related to if we use a terms query:

However, with the last one, I'm noticing we actually have lots of works where the tags are not lowercased, which I believe might be because tag cleanup isn't actually running? I opened an issue to look into here: #3342. It would be good to understand what's going on there if we're going to use a terms query so that we know what to expect and what to document.

api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
@@ -37,6 +44,14 @@
DEFAULT_BOOST = 10000
DEFAULT_SEARCH_FIELDS = ["title", "description", "tags.name"]

FILTER_TYPE = Literal["filter", "exclude"]
if TYPE_CHECKING:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not familiar with this constant. Reading the Python docs, it implies that it's useful to avoid expensive imports. That isn't the case here because the serializers are always part of our app, so avoiding importing them in the previous check or avoiding creating this type don't immediately appear to have a benefit.

Can you add a comment explaining why the TYPE_CHECKING check is a good idea for this and other site we check it? Also, is it possible to combine them into a single conditional rather than checking twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fix for circular imports: if we import the serializers here directly, the app will fail with the circular import error. Here's more info on how TYPE_CHECKING is used for resolving circular import errors: https://mypy.readthedocs.io/en/stable/runtime_troubles.html#import-cycles
Should I add a comment with this link?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool! I only found the reference in the Python docs. Just a comment saying "avoid circular import" is fine from me.

api/api/controllers/search_controller.py Show resolved Hide resolved

search_query = create_search_query(search_params)
s = s.query(search_query)
if strategy == "search":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just a gripe/wondering what other approaches exist, no real suggestion here or request for changes)

I wish it was possible to isolate the strategy-specific and generic stuff in this method. The only way that comes to mind is to create the Search object before the strategy check and then pass it for mutation into the build_*_query functions, but maybe that's harder to follow. Or going full OOP strategy pattern to isolate things, but that doesn't seem worth it either, unless the strategies got a lot more complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted build_query to make this part a bit clearer.

api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/api/docs/base_docs.py Outdated Show resolved Hide resolved
api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
@sarayourfriend
Copy link
Collaborator

This is testing well for me locally 🚀. Just a couple of small details with the tags route to iron out, but the rest of this is solid 🪨

@obulat obulat force-pushed the additional_search_views/collection_routes branch from 5728670 to 2e7540c Compare November 13, 2023 09:27
@obulat obulat requested a review from stacimc November 13, 2023 09:39
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@obulat obulat force-pushed the additional_search_views/collection_routes branch from 2e7540c to b48b865 Compare November 14, 2023 04:32
@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@obulat, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests well for me! I would prefer that one docstring update, but nothing blocking. Kudos @obulat, this is great 🚀

api/api/docs/base_docs.py Outdated Show resolved Hide resolved
obulat and others added 15 commits November 16, 2023 19:24
Signed-off-by: Olga Bulat <[email protected]>
 Notes on fuzzy matching for query params and maximum page_size documentation
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
@obulat obulat force-pushed the additional_search_views/collection_routes branch from 48bb14b to 9c0a6dd Compare November 16, 2023 16:24
Co-authored-by: sarayourfriend <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
@obulat obulat force-pushed the additional_search_views/collection_routes branch from 9c0a6dd to 3248289 Compare November 16, 2023 16:39
@obulat obulat merged commit 3986e71 into main Nov 16, 2023
43 checks passed
@obulat obulat deleted the additional_search_views/collection_routes branch November 16, 2023 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Create a controller for Additional search views Add additional search views endpoints to the API
5 participants