Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_proportional_by_source_staging_index DAG does not base proportions off source index #3761

Closed
stacimc opened this issue Feb 7, 2024 · 2 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Collaborator

stacimc commented Feb 7, 2024

Problem

As described in the IP, the create_proportional_by_source_staging_index uses the stats endpoint (e.g. https://api.openverse.engineering/v1/images/stats/) to get the record counts by source in production. It then creates a new index that is a fraction of the size (as determined by the percentage_of_prod param), but with the same proportion of records per each source.

The problem is that the stats endpoint only provides stats for the main, unfiltered media index, while the proportional index DAG allows you to create indices based off of any production index (and in fact defaults to the filtered index). So if you try to make a new index that is 50% of the filtered index for example, your new index will match the filtered configuration and draw records from that index -- but it will be 50% of the size of the unfiltered index and match its source proportions.

This is unintuitive. A few potential problems:

  • The size of the index may be much bigger than expected if the source_index is a small subset of the main media index
  • If the source_index has very different proportions than the main media index -- for example, if a particular source is heavily filtered -- this will not be apparent in the new index.

Description

We should be careful about making changes to the public api, for example by adding stats for other indices. However, we could replace this logic in the create_proportional_by_source_staging_index DAG by querying elasticsearch directly.

This query can be used:

{
    "size": 0,
    "aggs": {
        "unique_sources": {
            "terms": {
                "field": "source",
                "size": 100,
                "order": {"_key": "desc"}
            }
        }
    }
}

In fact this is what the search controller is doing for the stats endpoint.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Feb 7, 2024
@stacimc stacimc self-assigned this Feb 7, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Feb 7, 2024
@openverse-bot openverse-bot moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Feb 7, 2024
@stacimc
Copy link
Collaborator Author

stacimc commented Feb 7, 2024

Setting the priority relatively low because for the use cases we can anticipate, the difference won't be severe -- although even then, I think this is very counter-intuitive and needs to be resolved.

I checked the record counts for the filtered and unfiltered image indices in production. Unsurprisingly the proportion for Flickr is the one with the biggest difference. Somewhat surprisingly, it is a bigger proportion of the filtered index.

image image-filtered Delta
Total image count 749,056,559 743,672,348 5,384,211
Flickr image count 507,925,257 50,4631,702 3,293,555
Flickr percentage of total 67.80866556700159 67.85672525772062 0.04805969071904315

Consequently, imagine we are using the DAG to make a new index based off the filtered image index, with percentage_of_prod set to 25%. If we used the stats of the image-filtered index, we would end up with:

  • 126,981,314 total records
  • 86,165,362 Flickr records
    But using the stats of the unfiltered index we actually end up with:
  • 187,264,140 total records
  • 126,981,314 Flickr records

The biggest consequence is that the index is considerably larger than we wanted, but the proportions of the filtered vs unfiltered indices are (currently, at least) very close.


All that being said, I think it actually ended up being more complicated to implement the version using the stats endpoint than it would be to do this query, so I think this is very doable within the milestone still.

@stacimc stacimc moved this from 📅 To Do to 🏗 In Progress in Openverse Backlog Feb 21, 2024
@stacimc
Copy link
Collaborator Author

stacimc commented Mar 5, 2024

Fixed in #3763!

@stacimc stacimc closed this as completed Mar 5, 2024
@openverse-bot openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

No branches or pull requests

1 participant