-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add API routes and controllers for additional search views #2853
Conversation
8e15d88
to
1804d4c
Compare
1804d4c
to
348a941
Compare
0047964
to
2f56256
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this generally looks good — all the tests in the PR description passed (although I have some questions below about additional query parameters). A few times while testing this I had an API request take a very long time (> 10 seconds), for example on a creator search and when getting /source/flickr, but I couldn’t reproduce it consistently and it seemed likely that it was a local env problem. Mentioning in case other folks also notice it.
For the creator collections: I noticed that some records have a _url _for creator
(example sample data with id 040db469-9f73-4fed-b4d3-66592d554eb0, creator is https://www.rosanetur.com
). Should those be handled any differently?
__
I don’t want to take up time rehashing things that were explained in the IP stage, but I was pretty confused about the intended behavior and how it differs from the existing search parameters. For example, how does searching http://localhost:50280/v1/images/source/flickr/creator/Manzabar differ from the existing http://localhost:50280/v1/images/?source=flickr&creator=Manzabar? Is it that the authority/popularity boost isn’t applied in the new route? Likewise, how does the tag collection differ from querying something like http://localhost:50280/v1/audio/?source=jamendo&tags=acoustic ? Is it mostly that the tags aren’t ‘fuzzy’ matched? (I played around and got results for ‘acoustical’, for example, when searching ?tags=acoustic
). Or is it a performance difference?
I also wasn’t clear what additional filters can be applied on the collection views. So for example:
- http://localhost:50280/v1/audio/source/freesound/?tags=spring appears to silently ignore the
tags
filter and just returns all Freesound results. (In contrast, http://localhost:50280/v1/audio/?source=freesound&tags=spring appears to work and filters by both) - q is also ignored, so http://localhost:50280/v1/audio/source/freesound/?q=spring returns all Freesound results
- http://localhost:50280/v1/audio/source/freesound/?creator=KTManahan does work and returns the same results as http://localhost:50280/v1/audio/source/freesound/creator/KTManahan
- Other filters like
extension
andmature
work
I haven’t been involved in this project so I think I’m just missing a lot of context (especially, about how our own frontend actually consumes our API) — as I said, I don’t want to take up all your time time having things re-explained, sorry! But I think if it’s unclear to me from the API documentation, it might be confusing to others as well? Could we update the <media>_source / _source_creator / _tag
documentation to include (a) what additional query parameters are available for each of those routes and (b) how they differ from the equivalent search
?
Thank you for your review, @stacimc. Your comments about confusions due to lack of context are especially valuable because they help make the code and docs clearer - I can't see what isn't clear about it as I'm so immersed in the context now :)
I *also noticed the requests taking a long time locally, so I guess this is something I should investigate.
For the creator collections, we want to show all of the items by the selected creator in Openverse, and also link to their external
The existing filters match fuzzily by a stemmed value of the query. It's not so easy to understand with "Manzabar". A better example would be "photo": if you try searching for "photo" as creator, you will see that you'll get all items for which the creator has the words "photo" or "photos" in the "creator" field: https://api.openverse.engineering/v1/images/?source=flickr&creator=photo I suppose that there will also be the performance difference, since the boolean filter check (whether
The IP did not mention anything about the filters since the current designs do not show filtering, only displaying all of the items by the creator/source/tag. However, when I started making the changes in the API, I realized that we might want to filter the results (especially the Flickr images, for instance :) ). This is why I left filtering in the collections. At first, I thought that I should remove the creator/source/tag as filters since the views are already filtered by these parameters. I mainly wanted to have the licenses, categories, extensions, lengths, sensitivity as filters. This is why in your examples filtering by After your comment, though, I realized that we might want to, for instance, filter the creator view by a tag, or filter a tag view by a source. I'll look into how I can add this to the code. If it turns out to require too many changes, this should probably be implemented in another PR.
As I said, these requests for clarification are very valuable, thank you! I'll add more documentation to make it clearer. |
93c663c
to
fbb1cec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking good! I'm leaving a +1 to @stacimc questions.
The existing filters match fuzzily by a stemmed value of the query. It's not so easy to understand with "Manzabar". A better example would be "photo": if you try searching for "photo" as creator, you will see that you'll get all items for which the creator has the words "photo" or "photos" in the "creator" field: https://api.openverse.engineering/v1/images/?source=flickr&creator=photo
The http://localhost:50280/v1/images/source/flickr/creator/photo endpoint will return only the exact matches for the creator whose name is "photo", and would not return images by "Obtuse Photo" or "JSmith Photo", as the current search using the query parameter does.
This explanation makes it clear the necessity of the new creator endpoint. Thank you! In this section, I'm only missing the tests of the new views. That would help understand the differences in the code.
The IP describes that if the user clicks on the creator button on the single result page, they would expect to see the media by this specific creator, not by the creators that contain the word in their names:
If there are creators with the same name within a source, then the endpoint will merge their works in the results, right? I'm okay with this being a first approach but then it will take more work to strictly comply with what the plan says.
Thanks so much for your explanation, @obulat, now I understand why the new endpoints are needed :)
To clarify what I meant -- so for most records, the |
397bbdb
to
fa656fc
Compare
fa656fc
to
0e704e5
Compare
I rebased this PR on to #3039 to make the API documentation debugging easier. |
a3ff695
to
99d64b9
Compare
I wrote the code with the assumption that a single source does not have any creators with identical names. I think this is a fair assumption: for instance, you probably cannot register with a name that is already taken on Flickr. @krysal, I added some tests for the collection parameters, but I'm not sure what kind of tests I can add for collections. Could you please elaborate on what kind of tests you think are necessary here?
Thank you so much for explaining this in detail, @stacimc! I wasn't aware that the creator names can be URLs or can have other special symbols. I updated the I updated the DRF spectacular documentation (e.g. http://localhost:50280/v1/#tag/images/operation/images_source_creator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I haven't had a chance to test this locally yet, and will do so right after leaving this review. I wanted to make sure I left these comments now though, just in case I get interrupted later today while testing locally.
The only requested changes are related to if we use a terms query:
- Use
tags.name.keyword
to avoid using terms on a text field Cast tag inputs to lowercase(Add API routes and controllers for additional search views #2853 (comment))
However, with the last one, I'm noticing we actually have lots of works where the tags are not lowercased, which I believe might be because tag cleanup isn't actually running? I opened an issue to look into here: #3342. It would be good to understand what's going on there if we're going to use a terms query so that we know what to expect and what to document.
@@ -37,6 +44,14 @@ | |||
DEFAULT_BOOST = 10000 | |||
DEFAULT_SEARCH_FIELDS = ["title", "description", "tags.name"] | |||
|
|||
FILTER_TYPE = Literal["filter", "exclude"] | |||
if TYPE_CHECKING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not familiar with this constant. Reading the Python docs, it implies that it's useful to avoid expensive imports. That isn't the case here because the serializers are always part of our app, so avoiding importing them in the previous check or avoiding creating this type don't immediately appear to have a benefit.
Can you add a comment explaining why the TYPE_CHECKING check is a good idea for this and other site we check it? Also, is it possible to combine them into a single conditional rather than checking twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fix for circular imports: if we import the serializers here directly, the app will fail with the circular import
error. Here's more info on how TYPE_CHECKING
is used for resolving circular import errors: https://mypy.readthedocs.io/en/stable/runtime_troubles.html#import-cycles
Should I add a comment with this link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh cool! I only found the reference in the Python docs. Just a comment saying "avoid circular import" is fine from me.
|
||
search_query = create_search_query(search_params) | ||
s = s.query(search_query) | ||
if strategy == "search": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just a gripe/wondering what other approaches exist, no real suggestion here or request for changes)
I wish it was possible to isolate the strategy-specific and generic stuff in this method. The only way that comes to mind is to create the Search
object before the strategy check and then pass it for mutation into the build_*_query
functions, but maybe that's harder to follow. Or going full OOP strategy pattern to isolate things, but that doesn't seem worth it either, unless the strategies got a lot more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've extracted build_query
to make this part a bit clearer.
This is testing well for me locally 🚀. Just a couple of small details with the tags route to iron out, but the rest of this is solid 🪨 |
5728670
to
2e7540c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
2e7540c
to
b48b865
Compare
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @obulat, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests well for me! I would prefer that one docstring update, but nothing blocking. Kudos @obulat, this is great 🚀
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Notes on fuzzy matching for query params and maximum page_size documentation
Co-authored-by: Staci Mullins <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Co-authored-by: sarayourfriend <[email protected]>
Co-authored-by: sarayourfriend <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
48bb14b
to
9c0a6dd
Compare
Co-authored-by: sarayourfriend <[email protected]> Signed-off-by: Olga Bulat <[email protected]>
9c0a6dd
to
3248289
Compare
Fixes
Fixes #2849 by @obulat
Fixes #2850 by @obulat
Note
This PR was changed a lot after reviews, mainly removing the query parameters, so I wrote a new description in this comment. The old description is left here for the record.
Description
This PR adds the endpoints for additional search views for both audio and image. The endpoints start with
/v1/images
or/v1/audio
:/tag/tag_name
/source/source_name
/source/source_name/creator/creator_name
.These endpoints return paginated results because they accept
page
query parameter. They also support query parameters such aslicense
,license_type
,category
, [image-only]aspect_ratio
,size
, [audio-only]length
,peaks
. Currently, the frontend plans and designs do not allow for querying the collections using these filter parameters, but we can add this feature later.To enable this feature, I had to refactor the media serializers and the search controller.
Search controller
I extracted functions that are common for both the collection and search requests.
Both the
collection
s and thesearch
endpoints now call thesearch_controller.query_media
function.query_media
builds the relevant query (collection or search), sets up other search parameters, executes the search, tallies the results and builds the search context.The new
build_collection_query
creates a query from a dictionary of the possible query predicates (filter
,must_not
,should
andmust
dictionaries). All of these dictionaries are then combined with aBool
query. This makes the query much less nested than the query that we currently create in thesearch
method in thesearch_controller
.Serializers
This PR adds a
MediaCollectionRequestSerializer
to allow for filtering the collections using the filters such aslicense
orcategory
. This would allow us to make large collections easier to browse by various criteria such as the license or category.This new serializer reuses some common functionality from the
MediaSearchRequestSerializer
. To make it possible, I had to extract the common functionality intoMediaListRequestSerializer
.Sorting of the results
The results are sorted by
created_on
. The tag results also add the boost forunstable__authority
(for the authorized requests).Testing Instructions
Run the app using
just up
.Try various collection routes:
Try the routes that should return errors:
met
in the sample data, so the source validator throws an errorIf there are no items corresponding to the requested values, the empty collection can be returned:
http://localhost:50280/v1/images/source/flickr/creator/Manzaba/ (the creator name is missing the last
r
letter)Since the image and audio search had to be modified, too, check that they work as expected (the CI should check that, too).
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin