Use alternate method for getting fast subset of rows #736
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
🟩 priority: low
Low priority and doesn't need to be rushed
🧱 stack: api
Related to the Django API
Problem
PR WordPress/openverse-api#474 introduced an approach to creating a pseudo-random subset by ordering the primary query on
identifier
. Unfortunately, while I thought the index onidentifier
would help out, it appears that the query still takes an incredibly long time to return results.Description
We don't really care about true randomness or even an exact number of records selected, so we could potentially use an approach like this involving
TABLESAMPLE_SYSTEM
to get a fast subset: https://stackoverflow.com/a/8675160/3277713. One thing to consider here is ensuring this is robust during integration testing and copies sufficient data in that case as well. It may be necessary to base the estimate off table count and provide a bare minimum number of rows.Alternatives
Additional context
Implementation
The text was updated successfully, but these errors were encountered: