Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

Ability to Restart Graph From Failure #574

Closed
pattisdr opened this issue May 27, 2022 · 0 comments · Fixed by #578
Closed

Ability to Restart Graph From Failure #574

pattisdr opened this issue May 27, 2022 · 0 comments · Fixed by #578
Assignees
Labels
enhancement New feature or request

Comments

@pattisdr
Copy link
Contributor

What

If a particular collection fails during an access or an erasure request, raise an exception and cancel other tasks in the graph. Allow the privacy request to be restarted from the failure point.

Why

We currently retry a failed collection a specified number of times. If the collection continues to fail after a certain number of retries, we continue with the graph execution. We assume that the failed collection didn't return data, and that no data was masked. Downstream collections still run.

This can be problematic for both data in the failed collection and downstream collections whose data potentially wasn't retrieved or masked. Running another privacy request may not rectify the issue because data may have been destroyed that prevents us from reaching the collection in question again.

We will still keep the retry, in case the failure was just temporary, but stopping execution entirely allows the user to go correct something on their end before resuming.

@pattisdr pattisdr added the enhancement New feature or request label May 27, 2022
@pattisdr pattisdr self-assigned this May 27, 2022
@pattisdr pattisdr linked a pull request May 31, 2022 that will close this issue
10 tasks
seanpreston pushed a commit that referenced this issue Jun 7, 2022
* WIP Allow restart graph from failure.

- After retries have expired, throw an exception, cancelling remaining tasks, instead of continuing the graph execution.
- On failure, cache the failed step (access or erasure), and the failed collection.
- Add an API endpoint for resuming from failure.
- Refactor the methods used for caching the paused step and collection to share them with new methods to cache the failed step/collection,

* Add API endpoint tests for restarting from failed node. No request body is required.

* Add test that restarting from failure doesn't re-run already-executed nodes.

* Add tests for caching the failed step and collection.

* Fix imports.

* Add minor docs to guides.

* Fix retry tests. We now raise an exception after retries have been exceeded instead of continuing with execution.

* Fix items from rebase with erasure branch.

* Fix items from merge.

* Remove check if status is error because errored privacy requests will exit before we get to this point.

* Sqlalchemy bigquery upgrade experiment.

* Revert "Sqlalchemy bigquery upgrade experiment."

This reverts commit cfc2b79.

* Fix an existing bigquery bug that was revealed after the new failure behavior was added.  We should not build a bigquery update query if there is no data to update- this was incorrectly causing a query to be built that looks like: UPDATE `address` SET WHERE  address_id = 4;

- A failure at the collection level now causes the entire PrivacyRequest to fail, instead of ignoring the failed collection after "x" retries.  The above bug was previously being ignored in the test because the collection error was being suppressed.

* Update stripe erasure tests to only run with config.execution.MASKING_STRICT = False, so both update and delete actions can be performed.  Stripe has some endpoints whose update action is a "delete".  Saas configs will error if there is an attempt to mask but we haven't granted permission to use delete actions.

This test shouldn't have been running with MASKING_STRICT=True, because this particular config requires False for an erasure to run successfully, as there are mixtures of updates/deletes defined.  However, existing behavior that ignored a failed collection was still causing this privacy request to complete.

* Remove the primary key off of hubspot's owners' dataset, so we don't attempt a masking request on that collection. There's intentionally no update or delete configuration defined for owners right now.  This prevents us from trying to run an erasure against that collection for the time being.

(We were previously attempting to run an erasure and getting a failure that was ignored, but new execution behavior doesn't ignore failures.)
sanders41 pushed a commit that referenced this issue Jun 9, 2022
* WIP Cache SQL queries for the manual connector for retrieve/update data.  The manual connector is probably not a SQL database, but it's a pretty readable way to surface what needs to be performed manually.  The actually format will probably change.

- To the privacy request status endpoint, surface the stopped step, stopped collection, manual queries, and resume endpoint for paused or failed privacy requests.

* Add unit tests asserting expected response for paused/failed privacy requests.

* Refactor caching details about the collection that halted privacy request execution to store all details under the same key: the step, the collection, and any action needed to resume.

- Add a ManualQueryConfig
- Get rid of using the SQLQueryConfig to cache queries, instead opt to store these in a more generic way for later flexibility.

* Get rid of elements that cache a SQLQuery from an earlier draft. We're now caching more generic components.

* Remove import of element that no longer exists.

* Add new paused_at field for when a request is paused by a webhook or a manual collection.

- Start setting finished_processing_at on errored privacy requests that fail due to a collection issue.

* Small docstring changes.

* Add changelog and docs.

* Respond to CR -

* Fix wording in docs.
sanders41 pushed a commit that referenced this issue Sep 22, 2022
* WIP Allow restart graph from failure.

- After retries have expired, throw an exception, cancelling remaining tasks, instead of continuing the graph execution.
- On failure, cache the failed step (access or erasure), and the failed collection.
- Add an API endpoint for resuming from failure.
- Refactor the methods used for caching the paused step and collection to share them with new methods to cache the failed step/collection,

* Add API endpoint tests for restarting from failed node. No request body is required.

* Add test that restarting from failure doesn't re-run already-executed nodes.

* Add tests for caching the failed step and collection.

* Fix imports.

* Add minor docs to guides.

* Fix retry tests. We now raise an exception after retries have been exceeded instead of continuing with execution.

* Fix items from rebase with erasure branch.

* Fix items from merge.

* Remove check if status is error because errored privacy requests will exit before we get to this point.

* Sqlalchemy bigquery upgrade experiment.

* Revert "Sqlalchemy bigquery upgrade experiment."

This reverts commit cfc2b79.

* Fix an existing bigquery bug that was revealed after the new failure behavior was added.  We should not build a bigquery update query if there is no data to update- this was incorrectly causing a query to be built that looks like: UPDATE `address` SET WHERE  address_id = 4;

- A failure at the collection level now causes the entire PrivacyRequest to fail, instead of ignoring the failed collection after "x" retries.  The above bug was previously being ignored in the test because the collection error was being suppressed.

* Update stripe erasure tests to only run with config.execution.MASKING_STRICT = False, so both update and delete actions can be performed.  Stripe has some endpoints whose update action is a "delete".  Saas configs will error if there is an attempt to mask but we haven't granted permission to use delete actions.

This test shouldn't have been running with MASKING_STRICT=True, because this particular config requires False for an erasure to run successfully, as there are mixtures of updates/deletes defined.  However, existing behavior that ignored a failed collection was still causing this privacy request to complete.

* Remove the primary key off of hubspot's owners' dataset, so we don't attempt a masking request on that collection. There's intentionally no update or delete configuration defined for owners right now.  This prevents us from trying to run an erasure against that collection for the time being.

(We were previously attempting to run an erasure and getting a failure that was ignored, but new execution behavior doesn't ignore failures.)
sanders41 pushed a commit that referenced this issue Sep 22, 2022
* WIP Cache SQL queries for the manual connector for retrieve/update data.  The manual connector is probably not a SQL database, but it's a pretty readable way to surface what needs to be performed manually.  The actually format will probably change.

- To the privacy request status endpoint, surface the stopped step, stopped collection, manual queries, and resume endpoint for paused or failed privacy requests.

* Add unit tests asserting expected response for paused/failed privacy requests.

* Refactor caching details about the collection that halted privacy request execution to store all details under the same key: the step, the collection, and any action needed to resume.

- Add a ManualQueryConfig
- Get rid of using the SQLQueryConfig to cache queries, instead opt to store these in a more generic way for later flexibility.

* Get rid of elements that cache a SQLQuery from an earlier draft. We're now caching more generic components.

* Remove import of element that no longer exists.

* Add new paused_at field for when a request is paused by a webhook or a manual collection.

- Start setting finished_processing_at on errored privacy requests that fail due to a collection issue.

* Small docstring changes.

* Add changelog and docs.

* Respond to CR -

* Fix wording in docs.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant