-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor popularity SQL #2964
Refactor popularity SQL #2964
Conversation
This reverts commit 1735177.
@stacimc Just letting you know that I've seen this PR and will review it tomorrow. My brain is fried a the moment and I won't be able to comprehend this right but, but thanks in advance for what look like an excellent description and test instructions. Your PR descriptions are second to none and I deeply appreciate them 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That new decorator is a nifty abstraction! It's clear how much it cleans up the code that relies on it. Nice work identifying that.
Just leaving comments before testing locally, but doing that local testing right now.
@dataclass | ||
class SQLInfo: | ||
""" | ||
Configuration object for a media type's popularity SQL info. | ||
|
||
Required Constructor Arguments: | ||
|
||
media_table: name of the main media table | ||
metrics_table: name of the popularity metrics table | ||
standardized_popularity_fn: name of the standardized_popularity sql | ||
function | ||
popularity_percentile_fn: name of the popularity percentile sql | ||
function | ||
|
||
""" | ||
|
||
media_table: str | ||
metrics_table: str | ||
standardized_popularity_fn: str | ||
popularity_percentile_fn: str | ||
|
||
|
||
SQL_INFO_BY_MEDIA_TYPE = { | ||
AUDIO: SQLInfo( | ||
media_table=AUDIO, | ||
metrics_table="audio_popularity_metrics", | ||
standardized_popularity_fn="standardized_audio_popularity", | ||
popularity_percentile_fn="audio_popularity_percentile", | ||
), | ||
IMAGE: SQLInfo( | ||
media_table=IMAGE, | ||
metrics_table="image_popularity_metrics", | ||
standardized_popularity_fn="standardized_image_popularity", | ||
popularity_percentile_fn="image_popularity_percentile", | ||
), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. There's a similar dataclass in the API's testconf: https://github.com/WordPress/openverse/blob/HEAD/api/test/unit/conftest.py#L59-L92
def setup_kwargs_for_media_type( | ||
values_by_media_type: dict[str, Any], kwarg_name: str | ||
) -> callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question regarding the implementation: is the customisable kwarg_name
a byproduct of existing implementations? I'm wondering specifically why we don't always use the media_info
kwarg name. Is it because existing implementations use a different name and you wanted to avoid making changes to those names?
Edit: Never mind. I misunderstood how this function was used, but after reading setup_sql_info_for_media_type
at the end, I get now why kwarg_name
is configurable. It's dependent on what the "values" are in values_by_media_type
👍 Disregard this comment. Leaving it intact just in case someone else runs into the same misconception about how this works.
def setup_db_columns_for_media_type(func: callable) -> callable: | ||
"""Provide media-type-specific DB columns as a kwarg to the decorated function.""" | ||
return setup_kwargs_for_media_type(DB_COLUMNS_BY_MEDIA_TYPE, "db_columns")(func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, these implementations can be simplified a bit to:
def setup_db_columns_for_media_type(func: callable) -> callable: | |
"""Provide media-type-specific DB columns as a kwarg to the decorated function.""" | |
return setup_kwargs_for_media_type(DB_COLUMNS_BY_MEDIA_TYPE, "db_columns")(func) | |
setup_db_columns_for_media_type = setup_kwargs_for_media_type(DB_COLUMNS_BY_MEDIA_TYPE, "db_columns") | |
"""Provide media-type-specific DB columns as a kwarg to the decorated function.""" |
I'm not sure if that results in a worse docstring experience in the editor, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally! I didn't like the way the docstrings looked, but I actually have no idea what the "ideal" solution is here, that's obviously personal preference. I do think these merit the docstring either way, though.
raw_percentile_value: float, | ||
media_type: str, | ||
popularity_metrics: dict, | ||
sql_info: SQLInfo = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the None
default required here? Won't it always be supplied by @setup_sql_info_for_media_type
? Is the default to satisfy a typecheck for the caller, I guess so editors don't think the callsite needs to pass that kwarg?
I guess if the decorator implemented ParamSpec on the callable
type of the wrapped function and itself set the kwarg to optional it could handle the call site documentation. Complex though, so not worth it. It does a bit of confusion in the function implementation though, as sql_info
would never actually be None
... the decorator itself overrides the value, even if the caller explicitly set it to None
. I'd be tempted to check that sql_info is not None
if I was working in this function and didn't have a nuanced understanding of how setup_sql_info_for_media_type
worked.
To summarise: the None
default is confusing when making changes to the function, if the decorator's implementation details and implications1 aren't well understood. The risk of error is low (checking sql_info is not None
won't hurt, it's just technically useless) so it isn't a blocking issue, just a nit pick. Considering the complexity that it could take to solve this, if editors complain about not passing sql_info
, it might not be worth it to make any changes here. On the other hand, if the default parameter isn't necessary to satisfy basic editor introspection, it would improve the clarity of these functions to remove the default None
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on all counts -- unfortunately, the None
default does appear to be required when the function is also using the @task
decorator for TaskFlow, or we get errors at DAG parsing :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Like I said, great clean up, nice work identifying opportunities for clean and sensible abstractions 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love how elegant the decorators solution is, @stacimc ! It's also nice to see the code to gradually move to @task
syntax.
I ran everything in the testing instructions, and everything went well. I added a non-blocking comments inline.
Fixes
Fixes #2678 by @stacimc
Description
This PR is entirely clean-up/refactor and should have no functional changes to the catalog.
Several large files were moved, and some appear as entirely new files instead of as 'renamed' in the diff, contributing to a misleadingly large line number count. There are many changes just updating method signatures in tests, which also add a lot of lines. The review guide below goes through each file and explains the changes to make this easier to spot.
At a high level, the actual changes made are:
recreate_<media_type>_popularity_calculation
DAGs. These now just drop the popularity functions and recreate them. This allows the DAG to be used for updating any code changes made to the percentile or standardized popularity functions, but it won't affect the metrics table or constants.Testing Instructions
Check out this branch and try
just recreate
, to thoroughly test building the schema and loading sample data.Now try running:
pull_data
as successful as soon as a few 100 lines are written; we mostly want to test the loader steps.recreate_<media>_popularity_calculation
DAGs.create_<media_type>_filtered_index
DAGs before starting the data refreshes.batched_update
DAG before starting these.Code review guide
To review the code, here's an explanation of what changed file by file.
The only real new code is in creating the decorators. Recommended reviewing order:
Other, minor updates (mostly moving stuff around):
Test stuff:
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin