create_proportional_by_source_staging_index
DAG does not base proportions off source index
#3761
Labels
💻 aspect: code
Concerns the software code in the repository
🛠 goal: fix
Bug fix
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: catalog
Related to the catalog and Airflow DAGs
Milestone
Problem
As described in the IP, the
create_proportional_by_source_staging_index
uses the stats endpoint (e.g. https://api.openverse.engineering/v1/images/stats/) to get the record counts by source in production. It then creates a new index that is a fraction of the size (as determined by thepercentage_of_prod
param), but with the same proportion of records per each source.The problem is that the stats endpoint only provides stats for the main, unfiltered media index, while the proportional index DAG allows you to create indices based off of any production index (and in fact defaults to the filtered index). So if you try to make a new index that is 50% of the filtered index for example, your new index will match the filtered configuration and draw records from that index -- but it will be 50% of the size of the unfiltered index and match its source proportions.
This is unintuitive. A few potential problems:
source_index
is a small subset of the main media indexsource_index
has very different proportions than the main media index -- for example, if a particular source is heavily filtered -- this will not be apparent in the new index.Description
We should be careful about making changes to the public api, for example by adding stats for other indices. However, we could replace this logic in the
create_proportional_by_source_staging_index
DAG by querying elasticsearch directly.This query can be used:
In fact this is what the search controller is doing for the stats endpoint.
The text was updated successfully, but these errors were encountered: