Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use alternate method for getting fast subset of rows #736

Closed
1 task
AetherUnbound opened this issue Jan 21, 2022 · 3 comments
Closed
1 task

Use alternate method for getting fast subset of rows #736

AetherUnbound opened this issue Jan 21, 2022 · 3 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: api Related to the Django API

Comments

@AetherUnbound
Copy link
Collaborator

Problem

PR WordPress/openverse-api#474 introduced an approach to creating a pseudo-random subset by ordering the primary query on identifier. Unfortunately, while I thought the index on identifier would help out, it appears that the query still takes an incredibly long time to return results.

Description

We don't really care about true randomness or even an exact number of records selected, so we could potentially use an approach like this involving TABLESAMPLE_SYSTEM to get a fast subset: https://stackoverflow.com/a/8675160/3277713. One thing to consider here is ensuring this is robust during integration testing and copies sufficient data in that case as well. It may be necessary to base the estimate off table count and provide a bare minimum number of rows.

Alternatives

Additional context

Implementation

  • 🙋 I would be interested in implementing this feature.
@AetherUnbound AetherUnbound added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Jan 21, 2022
@AetherUnbound AetherUnbound changed the title Use alternate methos for getting fast subset of rows Use alternate method for getting fast subset of rows Jan 24, 2022
@AetherUnbound
Copy link
Collaborator Author

This relates to milestone 1.2.0 in the catalog repo: https://github.com/WordPress/openverse-catalog/milestone/2

@obulat obulat transferred this issue from WordPress/openverse-api Feb 22, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Feb 23, 2023
@obulat obulat added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels Mar 20, 2023
@sarayourfriend
Copy link
Collaborator

@AetherUnbound do we still need this? It seems like it might be missing some context. or is detached from issues/documents that would explain what's going on here. If we still need to solve this, can you add an explanation of why we need this and where it should go? I'm wondering if it's actually meant to be part of the ingestion server, rather than API? And if so, maybe the ingestion server removal work obviates the need for this or drastically changes how we would implement it?

@AetherUnbound
Copy link
Collaborator Author

Yes this had to do with the ingestion server's select limit when running a data refresh. This is actually relevant for @krysal's recent work with setting the limit on https://github.com/WordPress/openverse-infrastructure/pull/908. That said, this is if anything an extremely low priority and probably does not need to be addressed in the near future. We'll get data refreshes for staging up soon with #3925 I imagine, and after we start running and testing that we can see if this sort of thing is still necessary. CC @stacimc in case you have other thoughts, but for now we can close this.

@AetherUnbound AetherUnbound closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2024
@openverse-bot openverse-bot moved this from 📋 Backlog to 🗑 Discarded in Openverse Backlog May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

No branches or pull requests

3 participants