-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove popularity constants view #2883
Conversation
): | ||
if media_type == AUDIO: | ||
popularity_metrics_table = AUDIO_POPULARITY_METRICS_TABLE_NAME | ||
popularity_percentile = AUDIO_POPULARITY_PERCENTILE_FUNCTION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of following the existing conventions in this file, that will be cleaned up in #2678. It's not easy to refactor it in this PR without also touching a lot of other methods, and I felt that would make this harder to review.
4cc2983
to
a8ae7b0
Compare
I've gotten through testing the local stuff and it's working well so far aside from one caveat that might not be necessarily related to this PR (I haven't tried it on
I'm going to take a break now and when I get back I will test the production instructions and then review the code. |
Okay, I've tested the production instructions and things work! I still get the mapped task deferral behaviour with working popularity updates following the production instructions 🤷. It does work, just confused what the deal is with the deferred tasks. Reviewing the code now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. My comments are just fiddly suggestions, nothing blocking. As I said in my other comments, it tests perfectly fine for me locally with both sets of instructions. I don't know this process well enough to comment on whether there are additional ways we should test.
One other comment, though. With the production SQL you've shared: would it make sense/document things a bit if rather than updating the pseudo-migrations in docker/upstream_db
to reflect the changes, we added a new file in there to run after all the old schema instructions? I know it doesn't get used in production, but it would at least do the following, in my view:
- Document those changes in the codebase, making it easier to see what has run in production and help to understand the evolution of the schema over time, even if those changes are run manually in production (and if changes have been made directly in the past, not following a historical/migrations approach).
- Test the schema changes "automatically"; at first I thought this would be problematic for local testing and then remembered that actually the entrypoint does not re-run those SQL files when the container goes up, so you have to explicitly tear down the upstream db service anyway.
That's also a fiddly comment and deals more with a general approach in the catalogue's schema management and evolution. Probably fits in more with the wider discussion of #1836 and how to approach things in the meantime.
@@ -27,6 +27,14 @@ | |||
# https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/_api/airflow/providers/postgres/hooks/postgres/index.html#airflow.providers.postgres.hooks.postgres.PostgresHook.copy_expert # noqa | |||
|
|||
|
|||
def _single_value(cursor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this is in a shared location and used in other modules, we can remove the underscore prefix, right?
def _single_value(cursor): | |
def single_value(cursor): |
(Requires updates in import locations too, of course)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I'm going to make a note to update this in the followup in #2678, because even as simple as this is I'd still want to fully retest.
catalog/dags/popularity/refresh_popularity_metrics_task_factory.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The popularity refresh batched updates should go into a deferred state, but they should poll to check on the status of the update every 30 minutes for image and every 1 minute for audio. I have Do you have the |
@sarayourfriend I love this idea and your further explanation. It would make testing these sorts of changes much less involved in the future 😮 Only because this is blocking so much in the catalog, I think it's worth waiting until after this work is complete to decide on the process, maybe as part of #1836 as you say. |
The changes have been applied in production; this can now be merged. |
Fixes
Fast fix for urgent production issue.
Description
Note: throughout the PR description I sometimes refer only to the tables and views for image for simplicity, but the same is true for audio. This PR's code changes affect both media types, and both should be tested.
Background and problem
Currently we have an
image_popularity_metrics
table (that contains information like the name of themetric
field in themeta_data
column to use for calculating pop scores), and animage_popularity_constants
view which contains all the information from the metrics table, plus two calculated fields: a percentile value and popularity constant. The constant is based on the percentile value, and is what ultimately gets used during ingestion/popularity refreshes to calculate standardized popularity.Currently, when we want to update our popularity constants we refresh that view. This process is taking an extremely long time. We cannot drop and recreate the view instead, because the constants need to be available at all times for ingestion (i.e. we can't drop the old constants until the new ones are done being calculated). At any rate, recreating the view is also taking a long time.
Approach in this PR
This PR addresses the issue by removing the
image_popularity_constants
view, and instead adding the calculated columns directly to theimage_popularity_metrics
table. In production, we will manually hardcode these columns to their last known good values (taken from a snapshot).Then in the popularity refresh DAG, instead of refreshing the constants view I've added some new tasks that first calculate the percentile value, and then update the metrics table with the new percentile and constant once done. We use the exact same existing SQL functions to calculate the values, but SELECT them individually instead of refreshing the view.
Some benefits of this approach:
Testing Instructions
A note for code review: the code changes are in an area that will be cleaned up in #2678. Some things follow the existing conventions for now, because I wanted to keep this change as small as possible.
We need to test local env and the process for updating on production.
Local env
Run
just recreate
on this branch to make sure you have the schema changes and our sample data. Injust catalog/pgcli
:Now run the image popularity refresh DAG and make sure it works.
Now run the Flickr DAG until it ingests a few 100 records and mark it as a success.
To be very thorough, you can also run the
image_popularity_refresh
a second time and then inspect the metrics table to see that the constant and val for flickr were updated (now that we have new records).Repeat the process for
audio
. I recommend using Jamendo when you get to the step about running a provider DAG. These were the values I got after running a popularity refresh with audio on just the sample data:This is not necessary for testing this PR, but to be completely safe I personally tested the data refreshes as well. This branch includes all the popularity refresh changes, including #4580 :)
Production approach
Merging this PR will not automatically update the tables/views/functions in production. We have to do that manually. To test the process, first checkout main and run
just recreate
to go back to plain sample data and the old schema.Now run the image and audio popularity refreshes. (There is a bug on
main
, fixed in this PR, where the batched updates error when there are 0 rows to update -- as in nappy, rawpixel, and wikimedia in our sample data. You can ignore these or mark them as successful manually, it doesn't matter). Then in pgcli runSELECT * FROM public.audio_popularity_constants
andSELECT * FROM public.image_popularity_constants
and double check that the results look identical to the percentiles/constants calculated for the sample data on this test branch (this helps test that the constants are being calculated the exact same as before).We are going to simulate exactly what we will do on production, by first running these queries on our local copy of
main
. Then we'll check out this branch without doing arecreate
, simulating merging the PR. Bear with me!Let's start: while we're still on
main
, run the SQL queries that we'll run on production. I give the queries for both audio and image, with commentary:At this point, we have not merged the PR but provider DAGs and data refreshes can already be turned back on in production. DAGs at ingestion are using the constants from the metrics table, and data refreshes have been updated (in another PR) to not do popularity steps, so they're good to go.
The final step is to update the popularity refresh DAGs to update constants in the metrics table, rather than refreshing the view. That change makes it to production by merging this PR. Simulate that by checking out this branch (but do NOT
just recreate
, so you still have the manually edited schema). Runjust up
to pick up changes.Now in pgcli, we can finally drop the constants views:
And now try running the image and audio popularity refreshes and ensure they work!
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin