Ability to Restart Graph From Failure #574

pattisdr · 2022-05-27T17:49:27Z

What

If a particular collection fails during an access or an erasure request, raise an exception and cancel other tasks in the graph. Allow the privacy request to be restarted from the failure point.

Why

We currently retry a failed collection a specified number of times. If the collection continues to fail after a certain number of retries, we continue with the graph execution. We assume that the failed collection didn't return data, and that no data was masked. Downstream collections still run.

This can be problematic for both data in the failed collection and downstream collections whose data potentially wasn't retrieved or masked. Running another privacy request may not rectify the issue because data may have been destroyed that prevents us from reaching the collection in question again.

We will still keep the retry, in case the failure was just temporary, but stopping execution entirely allows the user to go correct something on their end before resuming.

* WIP Allow restart graph from failure. - After retries have expired, throw an exception, cancelling remaining tasks, instead of continuing the graph execution. - On failure, cache the failed step (access or erasure), and the failed collection. - Add an API endpoint for resuming from failure. - Refactor the methods used for caching the paused step and collection to share them with new methods to cache the failed step/collection, * Add API endpoint tests for restarting from failed node. No request body is required. * Add test that restarting from failure doesn't re-run already-executed nodes. * Add tests for caching the failed step and collection. * Fix imports. * Add minor docs to guides. * Fix retry tests. We now raise an exception after retries have been exceeded instead of continuing with execution. * Fix items from rebase with erasure branch. * Fix items from merge. * Remove check if status is error because errored privacy requests will exit before we get to this point. * Sqlalchemy bigquery upgrade experiment. * Revert "Sqlalchemy bigquery upgrade experiment." This reverts commit cfc2b79. * Fix an existing bigquery bug that was revealed after the new failure behavior was added. We should not build a bigquery update query if there is no data to update- this was incorrectly causing a query to be built that looks like: UPDATE `address` SET WHERE address_id = 4; - A failure at the collection level now causes the entire PrivacyRequest to fail, instead of ignoring the failed collection after "x" retries. The above bug was previously being ignored in the test because the collection error was being suppressed. * Update stripe erasure tests to only run with config.execution.MASKING_STRICT = False, so both update and delete actions can be performed. Stripe has some endpoints whose update action is a "delete". Saas configs will error if there is an attempt to mask but we haven't granted permission to use delete actions. This test shouldn't have been running with MASKING_STRICT=True, because this particular config requires False for an erasure to run successfully, as there are mixtures of updates/deletes defined. However, existing behavior that ignored a failed collection was still causing this privacy request to complete. * Remove the primary key off of hubspot's owners' dataset, so we don't attempt a masking request on that collection. There's intentionally no update or delete configuration defined for owners right now. This prevents us from trying to run an erasure against that collection for the time being. (We were previously attempting to run an erasure and getting a failure that was ignored, but new execution behavior doesn't ignore failures.)

* WIP Cache SQL queries for the manual connector for retrieve/update data. The manual connector is probably not a SQL database, but it's a pretty readable way to surface what needs to be performed manually. The actually format will probably change. - To the privacy request status endpoint, surface the stopped step, stopped collection, manual queries, and resume endpoint for paused or failed privacy requests. * Add unit tests asserting expected response for paused/failed privacy requests. * Refactor caching details about the collection that halted privacy request execution to store all details under the same key: the step, the collection, and any action needed to resume. - Add a ManualQueryConfig - Get rid of using the SQLQueryConfig to cache queries, instead opt to store these in a more generic way for later flexibility. * Get rid of elements that cache a SQLQuery from an earlier draft. We're now caching more generic components. * Remove import of element that no longer exists. * Add new paused_at field for when a request is paused by a webhook or a manual collection. - Start setting finished_processing_at on errored privacy requests that fail due to a collection issue. * Small docstring changes. * Add changelog and docs. * Respond to CR - * Fix wording in docs.

* WIP Allow restart graph from failure. - After retries have expired, throw an exception, cancelling remaining tasks, instead of continuing the graph execution. - On failure, cache the failed step (access or erasure), and the failed collection. - Add an API endpoint for resuming from failure. - Refactor the methods used for caching the paused step and collection to share them with new methods to cache the failed step/collection, * Add API endpoint tests for restarting from failed node. No request body is required. * Add test that restarting from failure doesn't re-run already-executed nodes. * Add tests for caching the failed step and collection. * Fix imports. * Add minor docs to guides. * Fix retry tests. We now raise an exception after retries have been exceeded instead of continuing with execution. * Fix items from rebase with erasure branch. * Fix items from merge. * Remove check if status is error because errored privacy requests will exit before we get to this point. * Sqlalchemy bigquery upgrade experiment. * Revert "Sqlalchemy bigquery upgrade experiment." This reverts commit cfc2b79. * Fix an existing bigquery bug that was revealed after the new failure behavior was added. We should not build a bigquery update query if there is no data to update- this was incorrectly causing a query to be built that looks like: UPDATE `address` SET WHERE address_id = 4; - A failure at the collection level now causes the entire PrivacyRequest to fail, instead of ignoring the failed collection after "x" retries. The above bug was previously being ignored in the test because the collection error was being suppressed. * Update stripe erasure tests to only run with config.execution.MASKING_STRICT = False, so both update and delete actions can be performed. Stripe has some endpoints whose update action is a "delete". Saas configs will error if there is an attempt to mask but we haven't granted permission to use delete actions. This test shouldn't have been running with MASKING_STRICT=True, because this particular config requires False for an erasure to run successfully, as there are mixtures of updates/deletes defined. However, existing behavior that ignored a failed collection was still causing this privacy request to complete. * Remove the primary key off of hubspot's owners' dataset, so we don't attempt a masking request on that collection. There's intentionally no update or delete configuration defined for owners right now. This prevents us from trying to run an erasure against that collection for the time being. (We were previously attempting to run an erasure and getting a failure that was ignored, but new execution behavior doesn't ignore failures.)

* WIP Cache SQL queries for the manual connector for retrieve/update data. The manual connector is probably not a SQL database, but it's a pretty readable way to surface what needs to be performed manually. The actually format will probably change. - To the privacy request status endpoint, surface the stopped step, stopped collection, manual queries, and resume endpoint for paused or failed privacy requests. * Add unit tests asserting expected response for paused/failed privacy requests. * Refactor caching details about the collection that halted privacy request execution to store all details under the same key: the step, the collection, and any action needed to resume. - Add a ManualQueryConfig - Get rid of using the SQLQueryConfig to cache queries, instead opt to store these in a more generic way for later flexibility. * Get rid of elements that cache a SQLQuery from an earlier draft. We're now caching more generic components. * Remove import of element that no longer exists. * Add new paused_at field for when a request is paused by a webhook or a manual collection. - Start setting finished_processing_at on errored privacy requests that fail due to a collection issue. * Small docstring changes. * Add changelog and docs. * Respond to CR - * Fix wording in docs.

pattisdr added the enhancement New feature or request label May 27, 2022

pattisdr self-assigned this May 27, 2022

pattisdr mentioned this issue May 27, 2022

Ability to replay requests (Access and Erasure) #523

Open

pattisdr mentioned this issue May 28, 2022

Restart Graph from Failure [#574] #578

Merged

10 tasks

pattisdr linked a pull request May 31, 2022 that will close this issue

Restart Graph from Failure [#574] #578

Merged

10 tasks

seanpreston closed this as completed in #578 Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to Restart Graph From Failure #574

Ability to Restart Graph From Failure #574

pattisdr commented May 27, 2022

Ability to Restart Graph From Failure #574

Ability to Restart Graph From Failure #574

Comments

pattisdr commented May 27, 2022

What

Why