-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new data refresh factory #4259
Conversation
This needs to be rebased on #4260 because the diff is currently very confusing, since this PR both renames the old data refresh module to |
9d2c216
to
c26656a
Compare
Full-stack documentation: https://docs.openverse.org/_preview/4259 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
Update: It passed after retrying. I don't know what happened here, but it looks like some kind of race condition? @stacimc I've tried running the audio data refresh locally, but it's failing at
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting! It passes for me locally, except for the odd issue I had with the audio data refresh, shared in a separate comment on the PR. It did eventually work fine when it retried, without me doing anything, so I'm assuming it's a small edge-case bug, rather than signifying a major issue with the approach overall.
I've requested a non-specific change to make it easier to understand the mapped tasks, but I don't even know if what I am asking is possible, or if I'm just looking at the page wrong and what I'm asking for is already there. In any case, it isn't a blocker, but if it can be done, either here or in a follow-up, I think it'd be nice to do (as an issue outside the project scope, not blocking its completion).
METRIC_COLUMN_SETUP_QUERY = dedent( | ||
""" | ||
ALTER TABLE {temp_table_name} ADD COLUMN IF NOT EXISTS | ||
standardized_popularity double precision; | ||
ALTER TABLE {temp_table_name} ALTER COLUMN | ||
view_count SET DEFAULT 0; | ||
""" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally going to ask if these dedents were necessary, and left the example below as a suggestion for how to avoid them, but I think the extra whitespace makes it all easier to read... even if needing to pass it through dedent
is a tedious pain.
METRIC_COLUMN_SETUP_QUERY = dedent( | |
""" | |
ALTER TABLE {temp_table_name} ADD COLUMN IF NOT EXISTS | |
standardized_popularity double precision; | |
ALTER TABLE {temp_table_name} ALTER COLUMN | |
view_count SET DEFAULT 0; | |
""" | |
) | |
METRIC_COLUMN_SETUP_QUERY = """ | |
ALTER TABLE {temp_table_name} ADD COLUMN IF NOT EXISTS | |
standardized_popularity double precision; | |
ALTER TABLE {temp_table_name} ALTER COLUMN | |
view_count SET DEFAULT 0; | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask the same, but yea, I agree that the current setup is actually a little more helpful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wonderful! I was able to test the process locally and everything functioned as expected 🎉
My only other big note is: I know that much of the logic here is DAG definitions, but it does make me wonder if anything here can/should have tests associated with it as well. Most of the action being done with the addition is in SQL, but it might still be useful to have some tests for the copy_data
module functions and/or the data refresh types.
METRIC_COLUMN_SETUP_QUERY = dedent( | ||
""" | ||
ALTER TABLE {temp_table_name} ADD COLUMN IF NOT EXISTS | ||
standardized_popularity double precision; | ||
ALTER TABLE {temp_table_name} ALTER COLUMN | ||
view_count SET DEFAULT 0; | ||
""" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask the same, but yea, I agree that the current setup is actually a little more helpful!
if ( | ||
configured_limit := Variable.get( | ||
"DATA_REFRESH_LIMIT", default_var=None, deserialize_json=True | ||
) | ||
is not None | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be a standard falsy check, and not necessarily is not None
? This would seem to mean that if DATA_REFRESH_LIMIT
were set to 0
, that would be returned as the configured limit - maybe that case should also return the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TL;DR is that this version just kept the exact behavior of the limits as we have in the existing ingestion server -- we used to interpret 0
as "no limit", but no longer do so as that's a little misleading. The result is that it's not possible to explicitly configure the limit to be None
, because it gets interpreted as if the variable weren't defined.
But we can and absolutely might as well fix that while we're here! 😅 So I've updated it to explicitly check whether the variable is defined. We still treat None
-> "no limit".
# Ensure that only one table is being copied at a time. | ||
max_active_tis_per_dagrun=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!!
@sarayourfriend I pushed changes but forgot to respond -- the issue you were seeing was related to the same id sequence being used for audio and audioset. I've updated to make sure each table gets it own sequence. Good catch!
@AetherUnbound how do you feel about adding a separate issue to the milestone? I agree with you here but think it might be worth a separate PR because of the work needed to setup those tests :/ |
That sounds great! I wasn't thinking it'd be that much extra, but having it as a separate issue in the milestone works for not blocking this PR while making sure we don't forget 😄 |
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was ready for review 5 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! One step closer! 😄
I'm getting a DAG import error at the moment:
I'll rebuild the catalog containers and see if it resolved the problem. Maybe my images are on an older Airflow version? Update: A rebuild fixed this. It was indeed that my image was on an older Airflow version. Curious that it didn't automatically rebuild 🤔 Then again, I don't know how it would know to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works perfect! The new mapped tasks view is awesome, exactly what I was hoping for!
By the way, at some point with this work we should add the catalog profile to the ES container in the compose stack. Right now the testing instructions miss specific instructions for running, so I just used just catalog/up
, but you need at least just api/up
to get the ES containers running as well.
4bb5bdf
to
6a259e0
Compare
Fixes
Fixes #4146 by @stacimc
Description
This PR adds the new data refresh factory, which generates DAGs containing (for now) the initial steps and the copy data steps.
These new DAGs will create the temp tables in the API and copy the data into them from the catalog. They do not apply constraints, do anything involving indexing, or promote tables. They should not be turned on in production (but it would be harmless if they were accidentally enabled).
In the code, TODOs are intentionally left to indicate where the additional pieces will be added.
Testing Instructions
Locally, run an image and an audio DAG for a few minutes to ingest some new records (I ran Cleveland and Jamendo).
Now try the
staging_audio_data_refresh
andstaging_image_data_refresh
. These should run successfully. You can trigger them at the same time to verify the concurrency handling works (whichever one starts second should wait on the other).Since these don't yet build the new ES indices or promote the tables, we can't rely on the logs from the
report_record_counts
tasks to see if they worked. Instead runjust api/pgcli
to access the local API database, and check to see that the temp tables were created and that they contain the expected number of records. Remember to check that a temp table was created for theaudioset
table as well asaudio
!If you want to run the DAGs multiple times you'll need to manually drop the temp tables here as well.
This code refactors everything from the ingestion server's refresh_api_table. You can review that code for comparison. I also recommend running a "normal" or legacy data refresh and checking the logs for the SQL that is executed during a data refresh, to compare with what is run here.
Checklist
Update index.md
).main
) or a parent feature branch.just catalog/generate-docs
for catalogPRs) or the media properties generator (
just catalog/generate-docs media-props
for the catalog or
just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin