Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove popularity constants view #2883

Merged
merged 15 commits into from
Aug 30, 2023
Merged

Conversation

stacimc
Copy link
Collaborator

@stacimc stacimc commented Aug 25, 2023

Fixes

Fast fix for urgent production issue.

⚠️ This is a very long PR description, but the goal was for the testing steps to be extremely thorough so it is easy for others to test. The changes are not particularly complicated, but we do want to be exhaustive in testing. Please ping me if you need any assistance at all. ⚠️

Description

Note: throughout the PR description I sometimes refer only to the tables and views for image for simplicity, but the same is true for audio. This PR's code changes affect both media types, and both should be tested.

Background and problem

Currently we have an image_popularity_metrics table (that contains information like the name of the metric field in the meta_data column to use for calculating pop scores), and an image_popularity_constants view which contains all the information from the metrics table, plus two calculated fields: a percentile value and popularity constant. The constant is based on the percentile value, and is what ultimately gets used during ingestion/popularity refreshes to calculate standardized popularity.

Currently, when we want to update our popularity constants we refresh that view. This process is taking an extremely long time. We cannot drop and recreate the view instead, because the constants need to be available at all times for ingestion (i.e. we can't drop the old constants until the new ones are done being calculated). At any rate, recreating the view is also taking a long time.

Approach in this PR

This PR addresses the issue by removing the image_popularity_constants view, and instead adding the calculated columns directly to the image_popularity_metrics table. In production, we will manually hardcode these columns to their last known good values (taken from a snapshot).

Then in the popularity refresh DAG, instead of refreshing the constants view I've added some new tasks that first calculate the percentile value, and then update the metrics table with the new percentile and constant once done. We use the exact same existing SQL functions to calculate the values, but SELECT them individually instead of refreshing the view.

Screenshot 2023-08-25 at 2 15 54 PM

Some benefits of this approach:

  • I'm hopeful that this will entirely circumvent whatever performance issues we're having with the materialized view
  • If we do continue to see performance issues, they will be much easier to debug because the scope of the investigation is reduced to just the sql functions.
  • If we get the constants into a strange state again, we can resolve things much more quickly because we can always manually update the constants to their last known values (for example, right now if the view is dropped there is absolutely no way to run ingestion or refreshes until the entire view can be fully recreated. You cannot simply insert into a view.)
  • Possibly even a performance improvement because the updates for each provider are done in parallel (although it's definitely possible that the optimizer was already parallelizing this in the matview refresh)

Testing Instructions

A note for code review: the code changes are in an area that will be cleaned up in #2678. Some things follow the existing conventions for now, because I wanted to keep this change as small as possible.

We need to test local env and the process for updating on production.

Local env

Run just recreate on this branch to make sure you have the schema changes and our sample data. In just catalog/pgcli:

-- Verify the constants view is gone (this should fail)
> describe image_popularity_constants;

-- Observe that by default you have no calculated constants (valuess for val/constant will all be null)
> select * from image_popularity_metrics;

Now run the image popularity refresh DAG and make sure it works.

-- See that you now have calculated popularity constants. The nulls are for providers we don't
-- have sample data for and are expected
> select * from image_popularity_metrics;
+-----------+--------------------+------------+-----------+--------+--------------------+
| provider  | metric             | percentile | val  | constant           |
|-----------+--------------------+------------+-----------+--------+--------------------|
| nappy     | downloads          | 0.85       | <null> | <null>             |
| rawpixel  | download_count     | 0.85       | <null> | <null>             |
| wikimedia | global_usage_count | 0.85       | <null> | <null>             |
| flickr    | views              | 0.85       | 35.0      | 6.176470588235295  |
| stocksnap | downloads_raw      | 0.85      | 120.0  | 21.176470588235297 |
+-----------+--------------------+------------+-----------+--------+--------------------+

Now run the Flickr DAG until it ingests a few 100 records and mark it as a success.

-- See that the newly ingested records have calculated standardized popularity. Note that a value of 0 is normal, but there
-- should be some records with non-0 scores
openledger> select foreign_identifier, standardized_popularity, updated_on from image where provider = 'flickr' order by updated_on desc limit 10;
+--------------------+-------------------------+-------------------------------+
| foreign_identifier | standardized_popularity | updated_on                    |
|--------------------+-------------------------+-------------------------------|
| 53136893332        | 0.0                     | 2023-08-25 00:26:10.817251+00 |
| 53136895092        | 0.0                     | 2023-08-25 00:26:10.817251+00 |
| 53136897297        | 0.39306358381502887     | 2023-08-25 00:26:10.817251+00 |
| 53136898092        | 0.32692307692307687     | 2023-08-25 00:26:10.817251+00 |
| 53136900057        | 0.0                     | 2023-08-25 00:26:10.817251+00 |
| 53136901057        | 0.0                     | 2023-08-25 00:26:10.817251+00 |
| 53136901902        | 0.0                     | 2023-08-25 00:26:10.817251+00 |
| 53136908212        | 0.8601864181091877      | 2023-08-25 00:26:10.817251+00 |
| 53136910592        | 0.7953216374269005      | 2023-08-25 00:26:10.817251+00 |
| 53136879167        | 0.32692307692307687     | 2023-08-25 00:26:10.817251+00 |
+--------------------+-------------------------+-------------------------------+

To be very thorough, you can also run the image_popularity_refresh a second time and then inspect the metrics table to see that the constant and val for flickr were updated (now that we have new records).

Repeat the process for audio. I recommend using Jamendo when you get to the step about running a provider DAG. These were the values I got after running a popularity refresh with audio on just the sample data:

+-----------------+--------------------+------------+-----------+-----------+---------------------+
| provider        | metric             | percentile | val     | constant            |
|-----------------+--------------------+------------+-----------+-----------+---------------------|
| wikimedia_audio | global_usage_count | 0.85       | 1.0       | 0.17647058823529416 |
| jamendo         | listens            | 0.85       | 2245028.0 | 396181.41176470596  |
| freesound       | num_downloads      | 0.85       | 34.0      | 6.000000000000002   |
+-----------------+--------------------+------------+-----------+-----------+---------------------+

This is not necessary for testing this PR, but to be completely safe I personally tested the data refreshes as well. This branch includes all the popularity refresh changes, including #4580 :)

Production approach

Merging this PR will not automatically update the tables/views/functions in production. We have to do that manually. To test the process, first checkout main and run just recreate to go back to plain sample data and the old schema.

Now run the image and audio popularity refreshes. (There is a bug on main, fixed in this PR, where the batched updates error when there are 0 rows to update -- as in nappy, rawpixel, and wikimedia in our sample data. You can ignore these or mark them as successful manually, it doesn't matter). Then in pgcli run SELECT * FROM public.audio_popularity_constants and SELECT * FROM public.image_popularity_constants and double check that the results look identical to the percentiles/constants calculated for the sample data on this test branch (this helps test that the constants are being calculated the exact same as before).

We are going to simulate exactly what we will do on production, by first running these queries on our local copy of main. Then we'll check out this branch without doing a recreate, simulating merging the PR. Bear with me!

Let's start: while we're still on main, run the SQL queries that we'll run on production. I give the queries for both audio and image, with commentary:

-- Audio

-- Update the metrics table to have the new columns
ALTER TABLE public.audio_popularity_metrics
ADD COLUMN val float,
ADD COLUMN constant float;

-- Insert vals into the metrics table. Here I'm using the vals we got from our sample data, but
-- on production we'll grab the actual percentile vals and constants from a snapshot.
UPDATE public.audio_popularity_metrics AS audio_metrics
SET val = new_vals.val, constant = new_vals.constant
FROM (values
	('wikimedia_audio', 1.0, 0.17647058823529416),
	('jamendo', 2245028.0, 396181.41176470596),
	('freesound', 34.0, 6.000000000000002)
) AS new_vals(provider, val, constant)
WHERE new_vals.provider = audio_metrics.provider;

-- Drop the standardized popularity function so we can recreate it to not rely on the constants
-- view. Note that dropping this function necessarily drops the matview (e.g. audio_view).
-- That's okay because this view is already no longer being used anywhere, and will soon be
-- dropped in an upcoming PR. When applying in prod, though, we should make sure no provider
-- DAGs are currently running since provider DAGs use this function.
DROP FUNCTION IF EXISTS public.standardized_audio_popularity CASCADE;

-- Recreate the standardized popularity function. The only change is that it is grabbing the
-- constant from the metrics table instead of the constants view. 
CREATE OR REPLACE FUNCTION public.standardized_audio_popularity(
 	provider text, meta_data jsonb
) RETURNS FLOAT AS $$
	SELECT ($2->>metric)::float / (($2->>metric)::float + constant)
	FROM public.audio_popularity_metrics WHERE provider=$1;
$$
LANGUAGE SQL
STABLE
RETURNS NULL ON NULL INPUT;

-- Note: we have not dropped the constants view yet!


-- Image. All queries are the same, just with 'image' swapped for 'audio'

ALTER TABLE public.image_popularity_metrics
ADD COLUMN val float,
ADD COLUMN constant float;

UPDATE public.image_popularity_metrics AS image_metrics
 SET val = new_vals.val, constant = new_vals.constant
 FROM (values
     ('flickr', 35.0, 6.176470588235295),
     ('stocksnap', 120.0, 21.176470588235297)
 ) AS new_vals(provider, val, constant)
 WHERE new_vals.provider = image_metrics.provider;

DROP FUNCTION IF EXISTS public.standardized_image_popularity CASCADE;

CREATE OR REPLACE FUNCTION public.standardized_image_popularity(
 	provider text, meta_data jsonb
) RETURNS FLOAT AS $$
	SELECT ($2->>metric)::float / (($2->>metric)::float + constant)
	FROM public.image_popularity_metrics WHERE provider=$1;
$$
LANGUAGE SQL
STABLE
RETURNS NULL ON NULL INPUT;

At this point, we have not merged the PR but provider DAGs and data refreshes can already be turned back on in production. DAGs at ingestion are using the constants from the metrics table, and data refreshes have been updated (in another PR) to not do popularity steps, so they're good to go.

The final step is to update the popularity refresh DAGs to update constants in the metrics table, rather than refreshing the view. That change makes it to production by merging this PR. Simulate that by checking out this branch (but do NOT just recreate, so you still have the manually edited schema). Run just up to pick up changes.

Now in pgcli, we can finally drop the constants views:

DROP MATERIALIZED VIEW public.audio_popularity_constants;
DROP MATERIALIZED VIEW public.image_popularity_constants;

And now try running the image and audio popularity refreshes and ensure they work!

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Aug 25, 2023
@stacimc stacimc self-assigned this Aug 25, 2023
@github-actions github-actions bot added the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Aug 25, 2023
@stacimc stacimc added 🟥 priority: critical Must be addressed ASAP and removed 🟧 priority: high Stalls work on the project or its dependents labels Aug 25, 2023
):
if media_type == AUDIO:
popularity_metrics_table = AUDIO_POPULARITY_METRICS_TABLE_NAME
popularity_percentile = AUDIO_POPULARITY_PERCENTILE_FUNCTION
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of following the existing conventions in this file, that will be cleaned up in #2678. It's not easy to refactor it in this PR without also touching a lot of other methods, and I felt that would make this harder to review.

@stacimc stacimc force-pushed the update/remove-popularity-constants-view branch from 4cc2983 to a8ae7b0 Compare August 25, 2023 22:35
@stacimc stacimc marked this pull request as ready for review August 25, 2023 23:01
@stacimc stacimc requested review from a team as code owners August 25, 2023 23:01
@sarayourfriend
Copy link
Collaborator

I've gotten through testing the local stuff and it's working well so far aside from one caveat that might not be necessarily related to this PR (I haven't tried it on main). Running the popularity refresh DAGs "worked" but for some reason all five of the refresh_popularity tasks got deferred both times I ran it locally. Regardless, I did get the updates values for flickr and stocksnap (despite the DAG never finishing successfully). Same behaviour for the audio one as well.

openledger> select * from image_popularity_metrics
+-----------+--------------------+------------+--------+--------------------+
| provider  | metric             | percentile | val    | constant           |
|-----------+--------------------+------------+--------+--------------------|
| nappy     | downloads          | 0.85       | <null> | <null>             |
| rawpixel  | download_count     | 0.85       | <null> | <null>             |
| wikimedia | global_usage_count | 0.85       | <null> | <null>             |
| stocksnap | downloads_raw      | 0.85       | 120.0  | 21.176470588235297 |
| flickr    | views              | 0.85       | 35.0   | 6.176470588235295  |
+-----------+--------------------+------------+--------+--------------------+

I'm going to take a break now and when I get back I will test the production instructions and then review the code.

@sarayourfriend
Copy link
Collaborator

Okay, I've tested the production instructions and things work! I still get the mapped task deferral behaviour with working popularity updates following the production instructions 🤷. It does work, just confused what the deal is with the deferred tasks.

Reviewing the code now.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. My comments are just fiddly suggestions, nothing blocking. As I said in my other comments, it tests perfectly fine for me locally with both sets of instructions. I don't know this process well enough to comment on whether there are additional ways we should test.

One other comment, though. With the production SQL you've shared: would it make sense/document things a bit if rather than updating the pseudo-migrations in docker/upstream_db to reflect the changes, we added a new file in there to run after all the old schema instructions? I know it doesn't get used in production, but it would at least do the following, in my view:

  1. Document those changes in the codebase, making it easier to see what has run in production and help to understand the evolution of the schema over time, even if those changes are run manually in production (and if changes have been made directly in the past, not following a historical/migrations approach).
  2. Test the schema changes "automatically"; at first I thought this would be problematic for local testing and then remembered that actually the entrypoint does not re-run those SQL files when the container goes up, so you have to explicitly tear down the upstream db service anyway.

That's also a fiddly comment and deals more with a general approach in the catalogue's schema management and evolution. Probably fits in more with the wider discussion of #1836 and how to approach things in the meantime.

@@ -27,6 +27,14 @@
# https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/_api/airflow/providers/postgres/hooks/postgres/index.html#airflow.providers.postgres.hooks.postgres.PostgresHook.copy_expert # noqa


def _single_value(cursor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this is in a shared location and used in other modules, we can remove the underscore prefix, right?

Suggested change
def _single_value(cursor):
def single_value(cursor):

(Requires updates in import locations too, of course)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I'm going to make a note to update this in the followup in #2678, because even as simple as this is I'd still want to fully retest.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the testing steps locally, and everything worked well except for the refresh_popularity step: all 5 mapped tasks show their status as deferred:
Screenshot 2023-08-28 at 6 17 47 PM
Yet, when I check the database popularity values using pgcli, they are updated.

@stacimc
Copy link
Collaborator Author

stacimc commented Aug 28, 2023

The popularity refresh batched updates should go into a deferred state, but they should poll to check on the status of the update every 30 minutes for image and every 1 minute for audio. I have DATA_REFRESH_POKE_INTERVAL=5 in my catalog/.env to modify this so that both media types poll every 5 seconds locally, but even if you don't have that set audio should complete pretty quickly.

Do you have the DATA_REFRESH_POKE_INTERVAL set? It's also possible something is up with the triggerer in your environments. I'm not particularly concerned since @sarayourfriend confirmed it happened when testing on main in the production instructions, and it does work for me.

@stacimc
Copy link
Collaborator Author

stacimc commented Aug 28, 2023

With the production SQL you've shared: would it make sense/document things a bit if rather than updating the pseudo-migrations in docker/upstream_db to reflect the changes, we added a new file in there to run after all the old schema instructions?

@sarayourfriend I love this idea and your further explanation. It would make testing these sorts of changes much less involved in the future 😮 Only because this is blocking so much in the catalog, I think it's worth waiting until after this work is complete to decide on the process, maybe as part of #1836 as you say.

@stacimc
Copy link
Collaborator Author

stacimc commented Aug 30, 2023

The changes have been applied in production; this can now be merged.

@stacimc stacimc merged commit ce14b26 into main Aug 30, 2023
@stacimc stacimc deleted the update/remove-popularity-constants-view branch August 30, 2023 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟥 priority: critical Must be addressed ASAP 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants