-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix advanced provider options for dags that override ingest records #4214
Fix advanced provider options for dags that override ingest records #4214
Conversation
…onal fixed query params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is definitely a complex set of circumstances! The ability to pull the fixed parameters out of the initial parameters is really great. I tested most of the cases locally and all worked as expected!
initial_fixed_params = next( | ||
( | ||
qp | ||
for qp in fixed_query_params | ||
if qp.items() <= self.initial_query_params.items() | ||
), | ||
fixed_query_params[0], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is...really wild that it works, honestly!
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things are getting hairy around the DAGs configuration—wow! But everything seems under control, which is impressive! 👏 I followed the instructions and tested some cases and params increment correctly. Nothing to add on my part.
catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Krystle Salazar <[email protected]>
Definitely -- we have a lot of use cases to cover 😅 I'm pleased to get more of that complexity into the base class, at least, so that the provider ingesters don't need to know about any of that! |
Fixes
Fixes #4207 by @stacimc
This PR aimed to fix a problem with Science Museum, but it turned out the source of the problem was a larger failure in the way many of our providers override aspects of the ProviderDataIngester base class. Consequently this PR actually ends up fixing quite a bit more!
When this is merged, we will be able to do a full DagRun of Science Museum with
skip_ingestion_errors
enabled, and actually process all possible records. This is blocking in order to enable Science Museum in production (see #4013 for context).This is a difficult one to explain, so please ping me if anything needs further explanation.
Description
Our provider workflows have several advanced options in DagRun confs -- notably,
initial_query_params
(which overrides the starting point of the DAG, allowing us to resume a DAG from a point of failure) andskip_ingestion_errors
(which skips batches with errors rather than halting the DAG, and then reports them in aggregate at the end).We also have a number of DAGs that override the
ingest_records
method in order to run ingestion multiple times for a set of fixed query params. (For example, Science Museum runsingest_records
multiple times for different date ranges. Within each date range, the page number and other params advance as normal). This breaks the advanced conf options for all of these DAGs. For example:initial_query_params
to a DAG that overridesingest_records
, it will start the first batch on those query params. It will also do so every timeingest_records
is called. This completely breaks the fixed params.skip_ingestion_errors
, ingestion errors will be skipped and then reported in aggregate -- but at the end of the current call toingest_records
. So for Science Museum for example, enablingskipped_ingestion_errors
will only allow the DAG to get through the first date range where it encounters an error. In production, this was in the 1500-1750 date range. So we did not ingest any records after 1750, even though we hadskip_ingestion_errors
enabled!This PR updates the ProviderDataIngester base class to add support for
fixed_query_params
in a way that does not break the advanced conf options, and allows all the DAGs that were previously overridingingest_records
to no longer do so. It does this by adding aget_fixed_query_params
method to the class which can optionally be overridden to provide a set of fixed query params, for each of whichingest_records
will be called. It also updates the TimeDelineatedProviderDataIngester to work with this new feature. Finally it updates all of the DAGs to use this.Unfortunately it was not really possible to do this in smaller pieces. The good news is that this PR fixes
initial_query_params
andskip_ingestion_errors
for a ton of DAGs that they would not have previously reliably worked for!ℹ️ Tip for reviewing: look at the changes to ProviderDataIngester first, then TimeDelineatedProviderDataIngester
Testing Instructions
To test this more easily, I applied the diff below to ProviderDataIngester to make it run more quickly (only processing the first two batches of each fixed param), and to log the query_params being used so you can see that they're advancing as expected.
Test some provider DAGs that were not changed/should not be affected
For example I ran Cleveland and ensured that it passed.
Test DAGs that previously overrode
ingest_records
Run all the DAGs that previously overrode
ingest_records
and ensure they do not fail:date
conf option to 2024-04-24 to ensure you get a day with data)You should see in the logs that the DAGs run ingestion multiple times for each of their fixed params. Science Museum for example will run
_ingest_records
for each of its date ranges. You can see in the logs something like this (I've omitted logs that are not relevant for brevity):Test
initial_query_params
Try running Science Museum with
initial_query_params
set to{"has_image": 1, "image_license": "CC", "page[size]": 100, "page[number]": 4, "date[from]": 200, "date[to]": 1500}
. Note that I've started at the second date range (a fixed param) and I've also started at a later page (a non fixed param).Verify in the logs that it starts ingestion on the correct date range and page number. The page number advances appropriately instead of resetting back to 1. When it advances to the next date range, it advances appropriately (200-1500 to 1500-1750, instead of resetting back to 0-200):
Test
skip_ingestion_errors
Update science_museum.py to raise a ValueError in
get_batch_data
. Keep your other patch applied so that it will only try to pull 2 batches per date range, and then run the DAG again. This time enable theskip_ingestion_errors
option and verify that not only does it skip errors for the first date range, but it continues to the next date range as well.Test both options at the same time, with a TimeDelineated provider
The TimeDelineatedProviderDataIngester should be tested separately. You can do this for Finnish or Flickr. Finnish is particularly finicky because it uses fixed params (iterating over
buildings
) and timestamp pairs.Update Finnish museums to raise an error during
get_batch_data
. Then run the DAG withdate
set to2024-04-24
,skip_ingestion_errors
enabled, andinitial_query_params
:You will have to wait for it to generate all the timestamp pairs. Once it begins ingestion, you should see in the logs that it begins at the
Museovirasto
building for timestamps[2024-04-24T20:09:00Z TO 2024-04-24T20:12:00Z]
and page 650. It should get an error which it will skip and advance to page 651. It should then proceed to the next chronological timestamp for the same building, starting at page 1.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin