Add API routes and controllers for additional search views #2853

obulat · 2023-08-21T15:06:07Z

Fixes

Fixes #2849 by @obulat
Fixes #2850 by @obulat

Note

This PR was changed a lot after reviews, mainly removing the query parameters, so I wrote a new description in this comment. The old description is left here for the record.

Description

This PR adds the endpoints for additional search views for both audio and image. The endpoints start with /v1/images or /v1/audio:

Tag view: /tag/tag_name
Source view: /source/source_name
Creator view: /source/source_name/creator/creator_name.
These endpoints return paginated results because they accept page query parameter. They also support query parameters such as license, license_type, category, [image-only] aspect_ratio, size, [audio-only] length, peaks. Currently, the frontend plans and designs do not allow for querying the collections using these filter parameters, but we can add this feature later.

To enable this feature, I had to refactor the media serializers and the search controller.

Search controller

I extracted functions that are common for both the collection and search requests.
Both the collections and the search endpoints now call the search_controller.query_media function. query_media builds the relevant query (collection or search), sets up other search parameters, executes the search, tallies the results and builds the search context.
The new build_collection_query creates a query from a dictionary of the possible query predicates (filter, must_not, should and must dictionaries). All of these dictionaries are then combined with a Bool query. This makes the query much less nested than the query that we currently create in the search method in the search_controller.

Serializers

This PR adds a MediaCollectionRequestSerializer to allow for filtering the collections using the filters such as license or category. This would allow us to make large collections easier to browse by various criteria such as the license or category.

This new serializer reuses some common functionality from the MediaSearchRequestSerializer. To make it possible, I had to extract the common functionality into MediaListRequestSerializer.

Sorting of the results

The results are sorted by created_on. The tag results also add the boost for unstable__authority (for the authorized requests).

Testing Instructions

Run the app using just up.
Try various collection routes:

http://localhost:50280/v1/images/source/stocksnap (source collection, only the sources that are present in the sample data can be used to test locally)
http://localhost:50280/v1/images/source/flickr/creator/Manzabar/ (creator collection)
http://localhost:50280/v1/images/tag/cat (tag collection, images)
http://localhost:50280/v1/audio/tag/birds (tag collection, audio)
http://localhost:50280/v1/audio/tag/birds/?extension=flac (additional query parameter)
http://localhost:50280/v1/images/source/flickr/?page=2 (second page of results)

Try the routes that should return errors:

http://localhost:50280/v1/images/source/met (We don't have anything from met in the sample data, so the source validator throws an error
http://localhost:50280/v1/images/source/flickr,stocksnap (The path parameters are not split by a comma, so only a single parameter is allowed)

If there are no items corresponding to the requested values, the empty collection can be returned:
http://localhost:50280/v1/images/source/flickr/creator/Manzaba/ (the creator name is missing the last r letter)

Since the image and audio search had to be modified, too, check that they work as expected (the CI should check that, too).

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

stacimc

I think this generally looks good — all the tests in the PR description passed (although I have some questions below about additional query parameters). A few times while testing this I had an API request take a very long time (> 10 seconds), for example on a creator search and when getting /source/flickr, but I couldn’t reproduce it consistently and it seemed likely that it was a local env problem. Mentioning in case other folks also notice it.

For the creator collections: I noticed that some records have a _url _for creator (example sample data with id 040db469-9f73-4fed-b4d3-66592d554eb0, creator is https://www.rosanetur.com). Should those be handled any differently?

__

I don’t want to take up time rehashing things that were explained in the IP stage, but I was pretty confused about the intended behavior and how it differs from the existing search parameters. For example, how does searching http://localhost:50280/v1/images/source/flickr/creator/Manzabar differ from the existing http://localhost:50280/v1/images/?source=flickr&creator=Manzabar? Is it that the authority/popularity boost isn’t applied in the new route? Likewise, how does the tag collection differ from querying something like http://localhost:50280/v1/audio/?source=jamendo&tags=acoustic ? Is it mostly that the tags aren’t ‘fuzzy’ matched? (I played around and got results for ‘acoustical’, for example, when searching ?tags=acoustic). Or is it a performance difference?

I also wasn’t clear what additional filters can be applied on the collection views. So for example:

http://localhost:50280/v1/audio/source/freesound/?tags=spring appears to silently ignore the tags filter and just returns all Freesound results. (In contrast, http://localhost:50280/v1/audio/?source=freesound&tags=spring appears to work and filters by both)
q is also ignored, so http://localhost:50280/v1/audio/source/freesound/?q=spring returns all Freesound results
http://localhost:50280/v1/audio/source/freesound/?creator=KTManahan does work and returns the same results as http://localhost:50280/v1/audio/source/freesound/creator/KTManahan
Other filters like extension and mature work

I haven’t been involved in this project so I think I’m just missing a lot of context (especially, about how our own frontend actually consumes our API) — as I said, I don’t want to take up all your time time having things re-explained, sorry! But I think if it’s unclear to me from the API documentation, it might be confusing to others as well? Could we update the <media>_source / _source_creator / _tag documentation to include (a) what additional query parameters are available for each of those routes and (b) how they differ from the equivalent search?

api/api/serializers/audio_serializers.py

api/api/serializers/media_serializers.py

api/api/views/media_views.py

api/api/serializers/media_serializers.py

obulat · 2023-09-12T05:31:09Z

Thank you for your review, @stacimc. Your comments about confusions due to lack of context are especially valuable because they help make the code and docs clearer - I can't see what isn't clear about it as I'm so immersed in the context now :)

A few times while testing this I had an API request take a very long time (> 10 seconds), for example on a creator search and when getting /source/flickr, but I couldn’t reproduce it consistently and it seemed likely that it was a local env problem. Mentioning in case other folks also notice it.

I *also noticed the requests taking a long time locally, so I guess this is something I should investigate.

For the creator collections: I noticed that some records have a _url _for creator (example sample data with id 040db469-9f73-4fed-b4d3-66592d554eb0, creator is https://www.rosanetur.com). Should those be handled any differently?

For the creator collections, we want to show all of the items by the selected creator in Openverse, and also link to their external url, if it's available, with the "Open creator page" button:

For example, how does searching http://localhost:50280/v1/images/source/flickr/creator/Manzabar differ from the existing http://localhost:50280/v1/images/?source=flickr&creator=Manzabar? Is it that the authority/popularity boost isn’t applied in the new route?

The existing filters match fuzzily by a stemmed value of the query. It's not so easy to understand with "Manzabar". A better example would be "photo": if you try searching for "photo" as creator, you will see that you'll get all items for which the creator has the words "photo" or "photos" in the "creator" field: https://api.openverse.engineering/v1/images/?source=flickr&creator=photo
The http://localhost:50280/v1/images/source/flickr/creator/photo endpoint will return only the exact matches for the creator whose name is "photo", and would not return images by "Obtuse Photo" or "JSmith Photo", as the current search using the query parameter does. The IP describes that if the user clicks on the creator button on the single result page, they would expect to see the media by this specific creator, not by the creators that contain the word in their names:

I suppose that there will also be the performance difference, since the boolean filter check (whether creator is equal to "photo") should be faster than a search within the field (whether "photo" is contained in the creator field). That is a side effect for this change, though, with the main reason being the exact match.

I also wasn’t clear what additional filters can be applied on the collection views.

The IP did not mention anything about the filters since the current designs do not show filtering, only displaying all of the items by the creator/source/tag. However, when I started making the changes in the API, I realized that we might want to filter the results (especially the Flickr images, for instance :) ). This is why I left filtering in the collections.

At first, I thought that I should remove the creator/source/tag as filters since the views are already filtered by these parameters. I mainly wanted to have the licenses, categories, extensions, lengths, sensitivity as filters. This is why in your examples filtering by tags does not work. The creator filter (http://localhost:50280/v1/audio/source/freesound/?creator=KTManahan) wasn't supposed to work, oops...

After your comment, though, I realized that we might want to, for instance, filter the creator view by a tag, or filter a tag view by a source. I'll look into how I can add this to the code. If it turns out to require too many changes, this should probably be implemented in another PR.
Note that the filtering will not be available on the frontend as the designs (and the IP) don't include implementation of filtering.

I haven’t been involved in this project so I think I’m just missing a lot of context (especially, about how our own frontend actually consumes our API) — as I said, I don’t want to take up all your time time having things re-explained, sorry! But I think if it’s unclear to me from the API documentation, it might be confusing to others as well? Could we update the <media>_source / _source_creator / _tag documentation to include (a) what additional query parameters are available for each of those routes and (b) how they differ from the equivalent search?

As I said, these requests for clarification are very valuable, thank you! I'll add more documentation to make it clearer.

krysal

It's looking good! I'm leaving a +1 to @stacimc questions.

The existing filters match fuzzily by a stemmed value of the query. It's not so easy to understand with "Manzabar". A better example would be "photo": if you try searching for "photo" as creator, you will see that you'll get all items for which the creator has the words "photo" or "photos" in the "creator" field: https://api.openverse.engineering/v1/images/?source=flickr&creator=photo
The http://localhost:50280/v1/images/source/flickr/creator/photo endpoint will return only the exact matches for the creator whose name is "photo", and would not return images by "Obtuse Photo" or "JSmith Photo", as the current search using the query parameter does.

This explanation makes it clear the necessity of the new creator endpoint. Thank you! In this section, I'm only missing the tests of the new views. That would help understand the differences in the code.

The IP describes that if the user clicks on the creator button on the single result page, they would expect to see the media by this specific creator, not by the creators that contain the word in their names:

If there are creators with the same name within a source, then the endpoint will merge their works in the results, right? I'm okay with this being a first approach but then it will take more work to strictly comply with what the plan says.

api/api/serializers/audio_serializers.py

stacimc · 2023-09-13T23:07:10Z

Thanks so much for your explanation, @obulat, now I understand why the new endpoints are needed :)

For the creator collections, we want to show all of the items by the selected creator in Openverse, and also link to their external url, if it's available, with the "Open creator page" button:

To clarify what I meant -- so for most records, the creator is a name and then creator_url is the external url (eg "John Smith", "www.provider-site/users/john-smith" or whatever). I had noticed that at least in our sample data we have some records where both creator and creator_url are urls. In the example I gave, https://www.rosanetur.com is the 'creator' for that record (and creator_url is some other url). So my question is, what does that look like for fetching that creator's collection? Since naively, it would be http://localhost:50280/v1/images/source/flickr/creator/https://www.rosanetur.com/. How should that be urlencoded?

obulat · 2023-09-19T07:13:51Z

I rebased this PR on to #3039 to make the API documentation debugging easier.

obulat · 2023-09-19T11:20:12Z

If there are creators with the same name within a source, then the endpoint will merge their works in the results, right? I'm okay with this being a first approach but then it will take more work to strictly comply with what the plan says.

I wrote the code with the assumption that a single source does not have any creators with identical names. I think this is a fair assumption: for instance, you probably cannot register with a name that is already taken on Flickr.
There could be creators with the same name on different providers. They may or may not be the same creators. For instance, different people could be using the name "Olga" on Stocksnap and Flickr. At the same time, Rembrandt's works may be published by different sources. But we cannot make assumptions that they are the same (or different) creators, so the only possible way is to show the media by a specific creator on a specific source site.

@krysal, I added some tests for the collection parameters, but I'm not sure what kind of tests I can add for collections. Could you please elaborate on what kind of tests you think are necessary here?

To clarify what I meant -- so for most records, the creator is a name and then creator_url is the external url (eg "John Smith", "www.provider-site/users/john-smith" or whatever). I had noticed that at least in our sample data we have some records where both creator and creator_url are urls. In the example I gave, https://www.rosanetur.com is the 'creator' for that record (and creator_url is some other url). So my question is, what does that look like for fetching that creator's collection? Since naively, it would be http://localhost:50280/v1/images/source/flickr/creator/https://www.rosanetur.com/. How should that be urlencoded?

Thank you so much for explaining this in detail, @stacimc! I wasn't aware that the creator names can be URLs or can have other special symbols. I updated the creator endpoint regex to allow for /. You can now query by such a creator: localhost:50280/v1/images/source/flickr/creator/https://www.rosanetur.com/, or even filter them by some filter parameter: http://localhost:50280/v1/images/source/flickr/creator/https://www.rosanetur.com/?aspect_ratio=tall

I updated the DRF spectacular documentation (e.g. http://localhost:50280/v1/#tag/images/operation/images_source_creator)
Each endpoint now documents the path parameters and the query parameters, and shows the sample responses. Please let me know if this is insufficient and how to update it.

sarayourfriend

Looks good to me. I haven't had a chance to test this locally yet, and will do so right after leaving this review. I wanted to make sure I left these comments now though, just in case I get interrupted later today while testing locally.

The only requested changes are related to if we use a terms query:

Use tags.name.keyword to avoid using terms on a text field
~~Cast tag inputs to lowercase~~ (Add API routes and controllers for additional search views #2853 (comment))

However, with the last one, I'm noticing we actually have lots of works where the tags are not lowercased, which I believe might be because tag cleanup isn't actually running? I opened an issue to look into here: #3342. It would be good to understand what's going on there if we're going to use a terms query so that we know what to expect and what to document.