Add configuration options to skip ingestion errors #650

stacimc · 2022-08-06T00:34:14Z

Fixes

Fixes WordPress/openverse#1653 by @AetherUnbound

Background

Problem
If a provider workflow encounters errors during the pull_data step, the task will fail and any partial data that was collected will be loaded. To resume ingestion, the DAG must be manually re-reun, and because the ingestion task has no context for the previous run, it will re-process all of the data from the API from the very beginning. This can be very time consuming!

A note about alternative solutions
I attempted several strategies for allowing the pull_data task to retry, picking up from the last set of query parameters. This goes against the philosophy of Airflow, which expects tasks to be idempotent. Ultimately, even if we were willing to slightly misuse Airflow in this way, I think it would be very difficult if not impossible. For instance, Airflow actually explicitly clears XComs on every task retry.

Description

This PR attempts to tackle the problem with the addition of two new DagRun configuration options, which will be supported out of the box by all provider workflows extending the ProviderDataIngester:

skip_ingestion_errors: When set to true, any errors encountered during the fetching/processing of batch data will be caught, and ingestion allowed to continue. At the end of the ingestion, if there were any caught errors the task will still fail and report a summary of all encountered errors.
initial_query_params: When set, this will be passed in as the first set of query_params, and ingestion will resume from there.
query_params_list: A list of query_params. When set, ingestion will run for just this list of query_params. This can be used to re-run ingestion for just the problematic batches.

This PR also adds a new IngestionError custom exception which extends error messages to also report the query_params that were being used when the error was reached. These values can then be pasted into the DagRun configuration initial_query_params setting to re-run starting at the failed batch.

Notes/To Do

Because errors are caught within ingest_records, this means batches are skipped as opposed to individual records. We could fine tune this by only (or additionally) wrapping get_record_data in the try/except. What do you think?

Testing Instructions

Setup

To test this you will need to make a provider script throw periodic errors. I used Cleveland Museum and edited get_batch_data to add the following:

# Throw errors on batches 3, 6, and 8 (since skip starts at 0)
if response_json['info']['parameters']['skip'] in ['200', '500', '700']:
            raise Exception("Test error")

I also updated the batch_limit of Cleveland Museum to 100 to make it run slightly faster (if you do not do this, you'll need to use values ['2000', '5000', '7000'] in the code above).

I also capped the amount of records processed by adding an ingestion_limit Airflow variable in the UI with a value of 1000.

Tests

Let's simulate how we might use these configuration variables in the real world!

Simulate a scheduled DagRun (no conf options supplied) encountering an error

Run the Cleveland Museum workflow. If configured like mine, it will throw an error on page 2. Verify that the new Error message looks something like this:

[2022-08-06, 00:20:43 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: Test error
    query_params: {"cc": "1", "has_image": "1", "limit": 100, "skip": 200}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
<...>

Re-start the DAG from the point at which it failed, and silence errors

You can trigger a manual run with a config by right clicking the play button and selecting the option:

Here's an example of what my configuration looks like. For the initial_query_params, I've just copied directly from the error message in the previous step.

When skip_ingestion_errors is enabled, the DAG should run until the ingestion_limit is reached, and then the pull_data task should fail with a single AggregateIngestionError. Check the logs to see that all of the individual errors with their tracebacks were logged, along with a list of the query_params that caused the issues. This is logged separately for ease of copy/pasting into the dagrun conf.
Check the logs to verify that the initial_query_params used were the expected ones. You should see something like:

Using initial_query_params from dag_run conf: {"cc": "1", "has_image": "1", "limit": 100, "skip": 200}

Optionally play around with only setting one of skip_ingestion_errors or initial_query_params at a time

Simulate 'fixing the bug' and re-run the DAG for only the failed query_parameters

Our bug is easy to fix: just remove the lines you inserted into the provider script to raise the Exception 😄 Now we want to re-run the DAG but only for the affected batches. Here's my conf (you can copy/paste the list of query_params that resulted in errors from the logs of the failed run for convenience):

Re-run the DAG with the query_params_list and check the logs to make sure that get_batch was only run for those params!
Finally, run a provider_workflow that uses the ProviderDataIngester and which has not been configured to throw errors (eg Science Museum). Make sure it still works/passes.

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

stacimc · 2022-08-09T23:49:06Z

@AetherUnbound @obulat I'm putting this into review for early feedback, but especially curious for thoughts on the overall approach and my question about catching errors in processing individual records versus batches.

obulat

This is a really interesting solution to a very difficult problem, @stacimc ! I really like it.

Because errors are caught within ingest_records, this means batches are skipped as opposed to individual records. We could fine tune this by only (or additionally) wrapping get_record_data in the try/except. What do you think?

I think using batches is the best option because a batch uses the same query parameters, so we probably won't win too much if we try re-trying with individual records: we would still make a lot of requests anyways.

We could also add a different mode to the ProviderDataIngester for re-trying the batches that raised errors. It could accept a list of next_query_parameters and would not attempt to build the new queries. It would only go through the URLs for the failed batches. What do you think, @stacimc ?

stacimc · 2022-08-11T00:42:00Z

@obulat That's a great idea! Definitely feels useful to be able to re-run just the failed batches after a bug fix for example. I've added a query_params_list configuration option that should do just that.

stacimc · 2022-08-12T01:32:10Z

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

+                f"The following errors were encountered during ingestion:\n{errors_str}"
+            )
+
+    def get_query_params(


I was attempting to avoid renaming the current get_next_query_params method, but I'm not happy with this naming. Any suggestions?

Maybe determine_query_params?

stacimc · 2022-08-12T20:00:10Z

This is going to take a little bit of thought to rebase with #645. I'll work on that today.

AetherUnbound

This is fantastic! Way more thorough than I was initially thinking with this issue, which is excellent. This gives us a lot of fine-grain control for reruns which is exactly what we need 🤩

I have a few comments, my biggest concern is conformity to the changes in #645 with the main loop.

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

AetherUnbound · 2022-08-12T22:36:25Z

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

-        finally:
-            total = self.commit_records()
-            logger.info(f"Committed {total} records")
+            except Exception as error:


Based on the changes in #645, I think we'll need to be careful about bare Exception catches. It looks like if skip_ingestion_errors is True, then execution will continue. This breaks the case where the task is stopped by Airflow. I think we can catch that specific case by handling AirflowException differently and immediately re-raising that though!

Extremely good point. I think we'll also want to make sure we raise any ingestion errors that have previously been skipped before we encounter this.

What do you think about making skip_ingestion_errors a list of Exception types to skip, rather than a simple boolean?

I like it! Only trouble would be matching on the exceptions since we have to go from a string to an exception type 😅 I guess we could just do if exception.__type__.__name__ in skip_ingestion_errors? I think as long as we're handling the AirflowException case (and that's the right case for Airflow stuff in general) then we could add that down the line 🙂

tests/dags/providers/provider_api_scripts/test_provider_data_ingester.py

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

AetherUnbound

This is so close! I just have some thoughts about the final exception and what gets logged.

AetherUnbound · 2022-08-15T19:45:48Z

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

+            errors_str = "\n".join(
+                e.repr_with_traceback() for e in self.ingestion_errors
+            )
+            return AirflowException(


Thinking on this again, should this maybe be a new exception type, like AggregateIngestionError? AirflowException doesn't seem to be the right one to use 🤔

Since this exception will also get sent to Slack, maybe it would be ideal to log the more verbose errors and then raise a single exception at the end that says something along the lines of {len(self.ingestion_errors)} ingestion errors occurred while running?

I like AggregateIngestionError (I had considered making a custom error before and couldn't come up with a name 😅 ). Totally agreed on the error logging -- I was under the impression the Slack messages would truncate but it doesn't look like that's the case, thanks for testing!

Updated to only log the verbose errors, while throwing an AggregateIngestionError.

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

AetherUnbound

I tweaked the aggregate exception message, hope that's okay! This looks fantastic and works great. I'm very excited for us to be able to employ this going forward 🚀 ⚡ 🔧

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py

obulat

I love how this well this worked together with #645, and the evolution of the error messages in the logs was really entertaining :)
Great improvement, hopefully we'll have more efficient ingestion process with fewer request to the providers.

stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Aug 6, 2022

stacimc self-assigned this Aug 6, 2022

stacimc requested review from AetherUnbound and obulat August 9, 2022 23:45

stacimc marked this pull request as ready for review August 9, 2022 23:45

stacimc requested a review from a team as a code owner August 9, 2022 23:45

zackkrida added this to the v1.3.0 - Operational improvements/streamlining milestone Aug 10, 2022

obulat reviewed Aug 10, 2022

View reviewed changes

stacimc commented Aug 12, 2022

View reviewed changes

Pass dag_run conf to ProviderDataIngester

8f709bd

stacimc added 11 commits August 12, 2022 13:35

Update ProviderDataIngester to conditionally skip errors

20df903

Add tests

da52cc2

Document configuration options

452e67b

Improve error formatting

8f86fbf

Log initial query params for ease of testing

50a85d5

Don't catch errors in get_batch

f13eff0

Fix test

4edaa73

Fix test

fa0df8b

Do not swallow errors from get_batch and get_should_continue

f8de955

Log errors when they are skipped

638b1d4

Print tracebacks in error sumamry

dda3d78

stacimc added 5 commits August 12, 2022 13:47

Revert accidental change to Cleveland script

116016a

Test

85ecedf

Remove unnecessary printing of next_query_params

7719b97

Log list of bad query params, for convenience of copying

0809ffb

Add exception handling back in after rebase

410e8ab

stacimc force-pushed the feature/add-options-to-skip-ingestion-errors branch from 00a53d2 to 410e8ab Compare August 12, 2022 21:04

AetherUnbound reviewed Aug 12, 2022

View reviewed changes

stacimc added 3 commits August 12, 2022 16:35

Update method name, fix comments

5cb47ef

Update tests

73a7583

Never skip AirflowExceptions

ec08221

AetherUnbound reviewed Aug 15, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py Show resolved Hide resolved

AetherUnbound reviewed Aug 15, 2022

View reviewed changes

stacimc and others added 3 commits August 15, 2022 15:47

Raise AggregateIngestionError; only log verbose error messages

1914578

Make it easier to copy-paste query_params from logs

84367f0

Tweak aggregate error message wording

b44f2b2

AetherUnbound approved these changes Aug 16, 2022

View reviewed changes

stacimc commented Aug 16, 2022

View reviewed changes

openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py Show resolved Hide resolved

zackkrida requested a review from obulat August 16, 2022 15:09

obulat approved these changes Aug 16, 2022

View reviewed changes

stacimc merged commit 83c688d into main Aug 16, 2022

stacimc deleted the feature/add-options-to-skip-ingestion-errors branch August 16, 2022 16:41

This was referenced Aug 22, 2022

Refactor Metropolitan Museum of Art to use ProviderDataIngester #674

Merged

Add configuration to skip specific ingestion errors WordPress/openverse#1447

Closed

AetherUnbound mentioned this pull request Apr 17, 2023

Determine provider discrepancies on get_response_json try/catch WordPress/openverse#1446

Closed

1 task

AetherUnbound mentioned this pull request Apr 17, 2023

Allow execution timeouts to be overridden by Variables WordPress/openverse#1437

Closed

1 task

AetherUnbound mentioned this pull request Apr 17, 2023

Stocksnap does not process any data WordPress/openverse#1423

Closed

1 task

AetherUnbound mentioned this pull request Oct 25, 2022

Refactor Europeana to use ProviderDataIngester #821

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configuration options to skip ingestion errors #650

Add configuration options to skip ingestion errors #650

stacimc commented Aug 6, 2022 •

edited

Loading

stacimc commented Aug 9, 2022

obulat left a comment

stacimc commented Aug 11, 2022

stacimc Aug 12, 2022

AetherUnbound Aug 15, 2022

stacimc commented Aug 12, 2022

AetherUnbound left a comment

AetherUnbound Aug 12, 2022

stacimc Aug 13, 2022

AetherUnbound Aug 15, 2022 •

edited

Loading

AetherUnbound left a comment

AetherUnbound Aug 15, 2022

AetherUnbound Aug 15, 2022

stacimc Aug 15, 2022

AetherUnbound left a comment

obulat left a comment

Add configuration options to skip ingestion errors #650

Add configuration options to skip ingestion errors #650

Conversation

stacimc commented Aug 6, 2022 • edited Loading

Fixes

Background

Description

Notes/To Do

Testing Instructions

Setup

Tests

Simulate a scheduled DagRun (no conf options supplied) encountering an error

Re-start the DAG from the point at which it failed, and silence errors

Simulate 'fixing the bug' and re-run the DAG for only the failed query_parameters

Checklist

Developer Certificate of Origin

stacimc commented Aug 9, 2022

obulat left a comment

Choose a reason for hiding this comment

stacimc commented Aug 11, 2022

stacimc Aug 12, 2022

Choose a reason for hiding this comment

AetherUnbound Aug 15, 2022

Choose a reason for hiding this comment

stacimc commented Aug 12, 2022

AetherUnbound left a comment

Choose a reason for hiding this comment

AetherUnbound Aug 12, 2022

Choose a reason for hiding this comment

stacimc Aug 13, 2022

Choose a reason for hiding this comment

AetherUnbound Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

AetherUnbound Aug 15, 2022

Choose a reason for hiding this comment

AetherUnbound Aug 15, 2022

Choose a reason for hiding this comment

stacimc Aug 15, 2022

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

obulat left a comment

Choose a reason for hiding this comment

stacimc commented Aug 6, 2022 •

edited

Loading

AetherUnbound Aug 15, 2022 •

edited

Loading