Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor popularity SQL #2964

Merged
merged 20 commits into from
Sep 12, 2023
Merged

Refactor popularity SQL #2964

merged 20 commits into from
Sep 12, 2023

Conversation

stacimc
Copy link
Collaborator

@stacimc stacimc commented Sep 2, 2023

Fixes

Fixes #2678 by @stacimc

Description

This PR is entirely clean-up/refactor and should have no functional changes to the catalog.

Several large files were moved, and some appear as entirely new files instead of as 'renamed' in the diff, contributing to a misleadingly large line number count. There are many changes just updating method signatures in tests, which also add a lot of lines. The review guide below goes through each file and explains the changes to make this easier to spot.

At a high level, the actual changes made are:

  • I created decorators that are used to fill in media-type-specific kwargs in SQL tasks. This replaces old, duplicated code in every one of these functions that would check the media type and set these params.
  • I updated the loader and popularity SQL to use these decorators
  • I removed most of the functionality of the recreate_<media_type>_popularity_calculation DAGs. These now just drop the popularity functions and recreate them. This allows the DAG to be used for updating any code changes made to the percentile or standardized popularity functions, but it won't affect the metrics table or constants.
  • I dropped all references to the audio and image matviews, that are no longer being used
  • Very minor cleanup like adding types and converting tasks to use TaskFlow

Testing Instructions

Check out this branch and try just recreate, to thoroughly test building the schema and loading sample data.

Now try running:

  • At least one audio and one image provider DAG. Suggestions: jamendo, SMK. You can mark the pull_data as successful as soon as a few 100 lines are written; we mostly want to test the loader steps.
  • Both recreate_<media>_popularity_calculation DAGs.
  • Both data refreshes. Remember to enable the create_<media_type>_filtered_index DAGs before starting the data refreshes.
  • Both popularity refreshes. Remember to enable the batched_update DAG before starting these.

Code review guide

To review the code, here's an explanation of what changed file by file.

The only real new code is in creating the decorators. Recommended reviewing order:

  • dags/common/constants.py: creates a new SQLInfo dataclass for storing media-type-specific SQL information (table names, function names). Sets these up for audio and image.
  • dags/common/utils.py: Creates a decorator factory for creating decorators that will supply media-type-specific params to a function. Example usage is given in docs, and the factory is also used in this file to create the main decorator for supplying media-type-specific SQLInfo
  • dags/common/storage/db_columns.py: Create a decorator for supplying media-type-specific DB columns to functions
  • dags/common/storage/tsv_columns.py: Create a decorator for supplying media-type-specific TSV columns to functions
  • dags/popularity/sql.py: This is not new code, it was just moved from dags/common/popularity/sql.py. The tasks were all updated to use TaskFlow decorators, and to get media-type specific SQLInfo from the decorators. I also dropped some now unused code (related to creating the matview which no longer exists).
  • dags/common/loader/sql.py: Updates functions to get media-type-specific params from decorators

Other, minor updates (mostly moving stuff around):

  • DAGS.md: updates the documentation for the ‘recreate_popularity_calculation` DAGs to note that they no longer drop any views or refresh standardized scores.
  • dags/common/popularity/*: Files were deleted or moved to dags/popularity, as they are no longer shared by any other DAGs
  • dags/common/sql.py: Just removing an _ prefix from a utility method
  • dags/data_refresh/dag_factory.py: Just updates use of renamed utility
  • dags/data_refresh/recreate_popularity_calculation_dag_factory.py: This got moved to dags/popularity
  • dags/popularity/popularity_refresh_dag_factory.py: Some code got pulled out into sql.py. The popularity refresh steps were also moved into this file for simplicity, since they are no longer shared.
  • dags/popularity/popularity_refresh_types.py: Previously, the popularity metrics dicts were hard-coded in popularity/sql.py. I moved these to be configuration that’s part of the PopularityRefresh, so all popularity configuration is in one (hopefully intuitive) place.
  • dags/popularity/recreate_popularity_calculation_dag_factory.py: This file was moved from dags/data_refresh. In addition, I took out all the tasks related to dropping and recreating the constants view (which no longer exists) and the metric table (which we never want to drop). So this DAG now just drops and recreates the SQL functions.
  • dags/popularity/refresh_popularity_metrics_task_factory.py: This was previously shared code. It was deleted and the popularity tasks are now part of the popularity dag factory.
  • docker/upstream_db/*: Remove materialized views
  • load_sample_data.sh: Do not try to refresh the now dropped view.

Test stuff:

  • tests/dags/common/conftest.py: All these deleted fixtures were simply moved up a level to dags/conftest.py so they are accessible in more tests
  • tests/dags/common/loader/test_sql.py: Just updating method signatures in tests
  • tests/dags/common/popularity/*: Moved to tests/dags/popularity
  • tests/dags/conftest.py: Test fixtures were moved to this file
  • tests/dags/popularity/test_popularity_refresh_types.py: An existing test was pulled out into this file
  • tests/dags/popularity/test_sql.py: Mostly, updating the test data setups and calling the decorated functions correctly. Tests related to the matview were dropped.
  • tests/dags/test_dag_parsing: Add init tests for popularity dags
  • tests/test_utils/sql.py: Update constants

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Sep 2, 2023
@stacimc stacimc self-assigned this Sep 2, 2023
@github-actions github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Sep 2, 2023
@stacimc stacimc marked this pull request as ready for review September 5, 2023 23:36
@stacimc stacimc requested review from a team as code owners September 5, 2023 23:36
@stacimc stacimc requested a review from a team as a code owner September 5, 2023 23:36
@sarayourfriend
Copy link
Collaborator

@stacimc Just letting you know that I've seen this PR and will review it tomorrow. My brain is fried a the moment and I won't be able to comprehend this right but, but thanks in advance for what look like an excellent description and test instructions. Your PR descriptions are second to none and I deeply appreciate them 🙏

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That new decorator is a nifty abstraction! It's clear how much it cleans up the code that relies on it. Nice work identifying that.

Just leaving comments before testing locally, but doing that local testing right now.

Comment on lines +42 to +77
@dataclass
class SQLInfo:
"""
Configuration object for a media type's popularity SQL info.

Required Constructor Arguments:

media_table: name of the main media table
metrics_table: name of the popularity metrics table
standardized_popularity_fn: name of the standardized_popularity sql
function
popularity_percentile_fn: name of the popularity percentile sql
function

"""

media_table: str
metrics_table: str
standardized_popularity_fn: str
popularity_percentile_fn: str


SQL_INFO_BY_MEDIA_TYPE = {
AUDIO: SQLInfo(
media_table=AUDIO,
metrics_table="audio_popularity_metrics",
standardized_popularity_fn="standardized_audio_popularity",
popularity_percentile_fn="audio_popularity_percentile",
),
IMAGE: SQLInfo(
media_table=IMAGE,
metrics_table="image_popularity_metrics",
standardized_popularity_fn="standardized_image_popularity",
popularity_percentile_fn="image_popularity_percentile",
),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. There's a similar dataclass in the API's testconf: https://github.com/WordPress/openverse/blob/HEAD/api/test/unit/conftest.py#L59-L92

catalog/dags/common/utils.py Show resolved Hide resolved
Comment on lines +7 to +9
def setup_kwargs_for_media_type(
values_by_media_type: dict[str, Any], kwarg_name: str
) -> callable:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question regarding the implementation: is the customisable kwarg_name a byproduct of existing implementations? I'm wondering specifically why we don't always use the media_info kwarg name. Is it because existing implementations use a different name and you wanted to avoid making changes to those names?

Edit: Never mind. I misunderstood how this function was used, but after reading setup_sql_info_for_media_type at the end, I get now why kwarg_name is configurable. It's dependent on what the "values" are in values_by_media_type 👍 Disregard this comment. Leaving it intact just in case someone else runs into the same misconception about how this works.

Comment on lines +88 to +90
def setup_db_columns_for_media_type(func: callable) -> callable:
"""Provide media-type-specific DB columns as a kwarg to the decorated function."""
return setup_kwargs_for_media_type(DB_COLUMNS_BY_MEDIA_TYPE, "db_columns")(func)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, these implementations can be simplified a bit to:

Suggested change
def setup_db_columns_for_media_type(func: callable) -> callable:
"""Provide media-type-specific DB columns as a kwarg to the decorated function."""
return setup_kwargs_for_media_type(DB_COLUMNS_BY_MEDIA_TYPE, "db_columns")(func)
setup_db_columns_for_media_type = setup_kwargs_for_media_type(DB_COLUMNS_BY_MEDIA_TYPE, "db_columns")
"""Provide media-type-specific DB columns as a kwarg to the decorated function."""

I'm not sure if that results in a worse docstring experience in the editor, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally! I didn't like the way the docstrings looked, but I actually have no idea what the "ideal" solution is here, that's obviously personal preference. I do think these merit the docstring either way, though.

raw_percentile_value: float,
media_type: str,
popularity_metrics: dict,
sql_info: SQLInfo = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the None default required here? Won't it always be supplied by @setup_sql_info_for_media_type? Is the default to satisfy a typecheck for the caller, I guess so editors don't think the callsite needs to pass that kwarg?

I guess if the decorator implemented ParamSpec on the callable type of the wrapped function and itself set the kwarg to optional it could handle the call site documentation. Complex though, so not worth it. It does a bit of confusion in the function implementation though, as sql_info would never actually be None... the decorator itself overrides the value, even if the caller explicitly set it to None. I'd be tempted to check that sql_info is not None if I was working in this function and didn't have a nuanced understanding of how setup_sql_info_for_media_type worked.

To summarise: the None default is confusing when making changes to the function, if the decorator's implementation details and implications1 aren't well understood. The risk of error is low (checking sql_info is not None won't hurt, it's just technically useless) so it isn't a blocking issue, just a nit pick. Considering the complexity that it could take to solve this, if editors complain about not passing sql_info, it might not be worth it to make any changes here. On the other hand, if the default parameter isn't necessary to satisfy basic editor introspection, it would improve the clarity of these functions to remove the default None value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on all counts -- unfortunately, the None default does appear to be required when the function is also using the @task decorator for TaskFlow, or we get errors at DAG parsing :(

catalog/dags/popularity/sql.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Like I said, great clean up, nice work identifying opportunities for clean and sensible abstractions 🚀

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how elegant the decorators solution is, @stacimc ! It's also nice to see the code to gradually move to @task syntax.
I ran everything in the testing instructions, and everything went well. I added a non-blocking comments inline.

catalog/dags/common/utils.py Show resolved Hide resolved
catalog/dags/common/utils.py Outdated Show resolved Hide resolved
@stacimc stacimc merged commit 5112f10 into main Sep 12, 2023
45 checks passed
@stacimc stacimc deleted the update/clean-up-popularity-sql branch September 12, 2023 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Create a mapping of media type to SQL constants
3 participants