Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Add option to sort search results by created_on #916

Merged
merged 18 commits into from
Feb 15, 2023

Conversation

zackkrida
Copy link
Member

@zackkrida zackkrida commented Sep 7, 2022

Updated description by @dhruvkb

Description

This PR adds two query parameters unstable__sort_by and unstable__sort_dir to sort media items on the basis of the created_on field.

  • unstable__sort_by can take relevance (default) or indexed_on (sort by created_on field)
  • unstable__sort_dir can take desc (newest first, default) or asc (oldest first).

Testing instructions

  1. Check out the PR and bring up the services: just up.
  2. Search for media without sorting, https://localhost:50280/v1/images/ and note that the results have no specific sort order.
  3. Search for media with sorting, https://localhost:50280/v1/images/?unstable__sort_by=indexed_on and note that the results still have no specific sort order.
  4. Authenticate yourself.
  5. Search for media with sorting, https://localhost:50280/v1/images/?unstable__sort_by=indexed_on and note that the results are now sorted by created_on.
  6. See some result on the first page and then some on the last to get a feel for the sorting order and direction. You can also narrow down the query like Staci did.
  7. Try the above steps with unstable__sort_dir=asc as well.

Also see update below.


Original description by @zackkrida

This draft PR should not be merged. It's a simple proof of concept to showcase:

  • Adding created_on to the API responses
  • Sorting all searches by created_on

It's meant to show how ElasticSearch's built-in sorting functionality works. I would love to see someone expand on this (my python abilities are really limited 😅) in the following ways:

  • Make this sorting conditional on a sort_by=new query param.
  • Optional: Add a sort_dir=asc/desc option to configure the direction of sort
  • Make sure these query params are undocumented
  • Make this feature only accessible to authenticated users
    • Bonus: make access to this feature accessible only by a special user permission with a checkbox in the Django admin
  • Write tests

Once those items are completed we could test this against production data! We should also research the integrity of the created_on field, and make sure it is a true representation of when a creative work is added to the Openverse catalog and that we never accidentally rewrite that value in production.

@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Sep 7, 2022
@github-actions
Copy link

github-actions bot commented Sep 7, 2022

API Developer Docs Preview: Ready

https://wordpress.github.io/openverse-api/_preview/916

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

@zackkrida zackkrida changed the title Init Proof of concept: Sort all search results by created_on date Sep 7, 2022
@zackkrida
Copy link
Member Author

When @panchovm returns we can revisit turning this into a formal project with some frontend UI and additional testing.

@krysal krysal added 🟩 priority: low Low priority and doesn't need to be rushed 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository ⛔ status: blocked Blocked & therefore, not ready for work and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Oct 27, 2022
@krysal
Copy link
Member

krysal commented Oct 27, 2022

Marking as blocked until @panchovm is back and is able to work on the UI for this.

@sarayourfriend sarayourfriend added the 🖼️ aspect: design Concerns related to design label Nov 10, 2022
@sarayourfriend
Copy link
Contributor

@panchovm Just wanted to confirm whether you've seen this PR. I've marked it as "design" to hopefully surface it better for you. I suspect this would need a frontend issue for the design of the feature? @zackkrida do you intend for this PR to get merged eventually or would this be picked up as part of a larger project down the road?

@fcoveram fcoveram removed the 🖼️ aspect: design Concerns related to design label Nov 10, 2022
@fcoveram
Copy link

I tried to replace the aspect: design label with needs design but could not find it. I think it's only available on the front-end repo.

I suspect this would need a frontend issue for the design of the feature?

Yes. And I envision a few design iterations before starting its implementation. I am thinking of different solutions that might force the current layout to tweak other areas.

Since this is marked with low priority, I was planning to work on the homepage redesign before arriving at this task. But now that it is blocked, should we revisit the priority?

@fcoveram fcoveram self-assigned this Nov 10, 2022
@krysal
Copy link
Member

krysal commented Nov 10, 2022

It's still low priority @panchovm, no need to rush on this. I think the API feature can be completed before the design, but I see it as part of a larger project.

@AetherUnbound AetherUnbound added the needs design Needs a designer's touch before implementation can begin label Nov 17, 2022
@sarayourfriend
Copy link
Contributor

@zackkrida Should this be closed in favour of a more comprehensive planning? This change would be blocked until we have analytics, is my understanding at least, and it comes around for the MSR to review weekly to check this PR until then.

@zackkrida
Copy link
Member Author

@sarayourfriend, there was a recent sync session where folks discussed this, and there was actually consensus on having someone open and finish this PR on the sooner side. The idea being that it would be useful for two things:

  • As a default sort order for provider landing pages on the frontend
  • The Openverse Gutenberg integration has some existing code scaffolded for allowing for sort by new which could leverage this as well

The only actual "work" this PR needs is to wrap this in a conditional which checks for a query string:

# Sort by new
s = s.sort({"created_on": {"order": "desc"}})

Sure we'll probably want test(s) as well, and to decide whether to leave this feature documented or undocumented for now.

Folks felt we could defer adding new frontend UI to allow users to sort by new. What do you think about this?

@sarayourfriend
Copy link
Contributor

Maybe we can follow the Gutenberg convention of adding unstable_ to the front of query params we aren't yet committed to. If we did that, then I don't have an opinion about whether this PR is finished, I just don't see a strong reason to do it now, especially if the frontend implementation wouldn't be worked on until some future unknown date (presumably when the change could be measured, though maybe people's sentiments on that too has changed).

@dhruvkb
Copy link
Member

dhruvkb commented Feb 9, 2023

I verified that in both the catalog and the API, the created_on field is not the date of creation of a particular media. However, the catalog can contain date_taken inside the meta_data column for some providers as seen for some Flickr images. This field isn't always present and being JSON can be named other things.

SELECT meta_data FROM image WHERE identifier = '9dd74c70-6069-4658-a843-01c1b072c6fc';
{"views": "21", "pub_date": "1284481786", "date_taken": "2010-09-11 08:25:09", "license_url": "https://creativecommons.org/licenses/by-nd/2.0/", "raw_license_url": null}

This makes me hesitant to proceed with this PR because created_on (in its current form) is not a useful basis for sorting.

@AetherUnbound
Copy link
Contributor

Maybe the framing of "new in Openverse" would be most appropriate for this.

@dhruvkb dhruvkb marked this pull request as ready for review February 10, 2023 08:30
@dhruvkb dhruvkb requested a review from a team as a code owner February 10, 2023 08:30
@dhruvkb dhruvkb requested review from krysal and stacimc February 10, 2023 08:30
@dhruvkb dhruvkb changed the title Proof of concept: Sort all search results by created_on date Add option to sort search results by created_on Feb 10, 2023
@dhruvkb dhruvkb removed ⛔ status: blocked Blocked & therefore, not ready for work needs design Needs a designer's touch before implementation can begin labels Feb 10, 2023
@dhruvkb
Copy link
Member

dhruvkb commented Feb 10, 2023

Update:

This PR is undrafted and open for review. The additional features have have been implemented:

  • This sorting behaviour is conditional, controlled by the unstable__sort_by param.
  • unstable__sort_dir (can be asc/desc) controls the direction of the sort.
  • These params are automatically hidden from the docs as long as they contain the unstable__ prefix.
  • This feature requires authentication. If an anonymous user sets these flags, it has no effect.
    • Adding user-level permission would require a migration, which is overkill at the PoC and testing stage.
  • Tests have been added.

This PR also adds documentation and fixes an ordering bug that occurs on main.

@zackkrida
Copy link
Member Author

Just noting here: I always interpreted this to mean "new to Openverse". If a site adds hundreds of cool new photos of a collection of Helenistic pottery from 30 BCE, I would want that to appear in this filter. "Recently added" is probably the most common name for this type of sorting.

That said, as long as we mark this "unstable" as you've done, @dhruvkb, I'm totally cool with merging this and playing with the results. We should formally make a decision about what this filter means and what value it offers users before we use it anywhere in production.

@dhruvkb
Copy link
Member

dhruvkb commented Feb 13, 2023

If 'newly indexed in Openverse' is what this sort was supposed to mean, then this works 100% according to your expectation. Personally I thought new meant 'newly created' or 'newly uploaded at the source' so I was surprised that the sorted results were totally different.

Again, 'recently added' is not the same as 'recently indexed'. If a site added photos of "Helenistic pottery from 30 BCE" in early 1990s and we just added it as a new provider, we'll still get those photos in this category (which is neither "recently created", nor "recently added", just "recently indexed").

Considering they're not already present, creation_date and uploaded_date fields might be something to consider? (cc: @WordPress/openverse-catalog)

@AetherUnbound
Copy link
Contributor

Considering they're not already present, creation_date and uploaded_date fields might be something to consider? (cc: @WordPress/openverse-catalog)

Definitely, though it seems like it might not be something that's present in all providers and so I wonder if a full-on column makes sense in those cases. Perhaps a way to add it in the MediaStore class that goes into meta_data 🤔

@stacimc
Copy link
Contributor

stacimc commented Feb 13, 2023

@dhruvkb Do you mind updating the description and testing instructions?

@dhruvkb
Copy link
Member

dhruvkb commented Feb 13, 2023

@stacimc done!

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested locally while anonymous and while authenticated and all looks well. I used the query q='waterfall' to reduce the results so I could easily see the full sorting order. I also verified that the created_on field is listed in the documentation under the response schema, but the unstable sort options are not documented. Sorting behavior worked just as described 🎉

For what it's worth, I also was under the impression that the filter was for "new to Openverse", for the pages mentioned in this comment (I also think it's still an interesting filter to have for search, although probably worth having analytics in place so we can verify that). But the concerns raised in this thread are all excellent points and make it very clear that getting the messaging right will be important to avoid confusion when frontend work is done.

As far as this PR, my only concern is naming the field 'created_on' while using 'indexed_on' for the sort param. I agree that indexed_on is the clearer of the two for describing what we actually mean, and would prefer to use it in both places unless it's critical that the field name match the field in the catalog. Would that be possible?

api/catalog/api/constants/field_order.py Outdated Show resolved Hide resolved
api/catalog/api/serializers/media_serializers.py Outdated Show resolved Hide resolved
api/catalog/api/serializers/media_serializers.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 🚀

@zackkrida
Copy link
Member Author

Since I'm technically the author, I can't officially approve this, but it looks excellent, @dhruvkb.

@zackkrida zackkrida requested a review from dhruvkb February 15, 2023 18:48
Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as proxy for @zackkrida.

@dhruvkb dhruvkb merged commit 2ea8704 into main Feb 15, 2023
@dhruvkb dhruvkb deleted the sort-by-new-proof-of-concept branch February 15, 2023 18:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants