Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

Restart Graph from Failure [#574] #578

Merged
merged 18 commits into from
Jun 7, 2022

Conversation

pattisdr
Copy link
Contributor

@pattisdr pattisdr commented May 28, 2022

Purpose

We currently can't rerun a graph from a failed node, our only choice is to attempt to run a completely new privacy request.

If a node in the graph fails a certain number of times, we continue with graph execution, just passing in empty values downstream to dependent nodes. This is problematic for a couple of reasons: we may not have properly retrieved/masked data on the failed collection, or downstream collections. Second, if we want to run another privacy request to rectify this, we may no longer be able to execute the graph because data has been destroyed.

Changes

  • Raise an exception when a graph node fails (instead of ignoring after "x" retries), and cancel remaining graph tasks. Sending a POST request to /privacy-request/{privacy_request_id}/retry will restart the graph, only running remaining graph tasks.
  • Fix existing issues with Bigquery/Hubspot/Stripe either in the code or the tests, whose collection failures were previously failing silently. Now that a collection failure causes the entire privacy request to fail, more errors are being raised than there were previously.

Note

Checklist

  • Update CHANGELOG.md file
    • Merge in main so the most recent CHANGELOG.md file is being appended to
    • Add description within the Unreleased section in an appropriate category. Add a new category from the list at the top of the file if the needed one isn't already there.
    • Add a link to this PR at the end of the description with the PR number as the text. example: #1
  • Applicable documentation updated (guides, quickstart, postman collections, tutorial, fidesdemo, database diagram.
  • If docs updated (select one):
    • documentation complete, or draft/outline provided (tag docs-team to complete/review on this branch)
    • documentation issue created (tag docs-team to complete issue separately)
  • Good unit test/integration test coverage
  • This PR contains a DB migration. If checked, the reviewer should confirm with the author that the down_revision correctly references the previous migration before merging
  • The Run Unsafe PR Checks label has been applied, and checks have passed, if this PR touches any external services

Ticket

Fixes #574

@pattisdr pattisdr linked an issue May 31, 2022 that may be closed by this pull request
@pattisdr pattisdr marked this pull request as ready for review May 31, 2022 14:06
Base automatically changed from fidesops_522_pause_erasure to main June 1, 2022 13:34
@pattisdr pattisdr force-pushed the fidesops_574_restart_from_failure branch from 4ad69cd to 25404fc Compare June 1, 2022 14:03
pattisdr added 8 commits June 1, 2022 09:07
- After retries have expired, throw an exception, cancelling remaining tasks, instead of continuing the graph execution.
- On failure, cache the failed step (access or erasure), and the failed collection.
- Add an API endpoint for resuming from failure.
- Refactor the methods used for caching the paused step and collection to share them with new methods to cache the failed step/collection,
…ceeded instead of continuing with execution.
@pattisdr pattisdr force-pushed the fidesops_574_restart_from_failure branch from 25404fc to 1ffd401 Compare June 1, 2022 14:09
@pattisdr
Copy link
Contributor Author

pattisdr commented Jun 1, 2022

@ethyca/docs-authors minor edit to guide added here

@seanpreston seanpreston self-assigned this Jun 1, 2022
@seanpreston
Copy link
Contributor

Thanks @pattisdr, I just have that one product level question re: the webhooks

CHANGELOG.md Show resolved Hide resolved
@pattisdr pattisdr added the run unsafe ci checks Triggers running of unsafe CI checks label Jun 3, 2022
@pattisdr pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jun 3, 2022
@pattisdr pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jun 3, 2022
@pattisdr pattisdr mentioned this pull request Jun 3, 2022
10 tasks
@pattisdr pattisdr marked this pull request as draft June 3, 2022 20:12
pattisdr added 2 commits June 6, 2022 11:38
…behavior was added. We should not build a bigquery update query if there is no data to update- this was incorrectly causing a query to be built that looks like: UPDATE `address` SET WHERE address_id = 4;

- A failure at the collection level now causes the entire PrivacyRequest to fail, instead of ignoring the failed collection after "x" retries.  The above bug was previously being ignored in the test because the collection error was being suppressed.
@pattisdr pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jun 6, 2022
pattisdr added 2 commits June 6, 2022 14:23
…_STRICT = False, so both update and delete actions can be performed. Stripe has some endpoints whose update action is a "delete". Saas configs will error if there is an attempt to mask but we haven't granted permission to use delete actions.

This test shouldn't have been running with MASKING_STRICT=True, because this particular config requires False for an erasure to run successfully, as there are mixtures of updates/deletes defined.  However, existing behavior that ignored a failed collection was still causing this privacy request to complete.
…attempt a masking request on that collection. There's intentionally no update or delete configuration defined for owners right now. This prevents us from trying to run an erasure against that collection for the time being.

(We were previously attempting to run an erasure and getting a failure that was ignored, but new execution behavior doesn't ignore failures.)
@pattisdr pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jun 6, 2022
Comment on lines -1074 to 1076
# run erasure with MASKING_STRICT to execute the update actions

config.execution.MASKING_STRICT = True
# Run erasure with masking_strict = False so both update and delete actions can be used
config.execution.MASKING_STRICT = False

Copy link
Contributor Author

@pattisdr pattisdr Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripe tests were being run with config.execution.MASKING_STRICT = True which is invalid for Stripe, because there are both updates and deletes defined in the config. Nodes with delete-only configs were failing silently and then Stripe tests were re-run with config.execution.MASKING_STRICT = False below to get delete-specific behavior.

Because a collection no longer fails silently, we were seeing failures here. To address, we can just run a single erasure request with config.execution.MASKING_STRICT = False so stripe can use the update if defined, otherwise it uses the delete. The counts below have been updated to reflect this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripe tests were incorrectly being run with config.execution.MASKING_STRICT = True which is invalid for Stripe, because there are both updates and deletes defined in the config. Nodes with delete-only configs were failing silently and then Stripe tests were re-run with config.execution.MASKING_STRICT = False below to get delete-specific behavior.

That's a really good catch @pattisdr — thanks

@@ -91,7 +91,6 @@ dataset:
- name: id
data_categories: [user.derived.identifiable.unique_id]
fidesops_meta:
primary_key: True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently shouldn't attempt to run an erasure against hubspot owners' endpoint. #361. Our tests were attempting to run this and failing silently.

This PR doesn't allow it to fail silently anymore, so this adjustment prevents us from running an erasure against hubspot owners' until we can sort out how to connect to that endpoint.

@pattisdr pattisdr marked this pull request as ready for review June 6, 2022 19:48

config.execution.MASKING_STRICT = True
# Run erasure with masking_strict = False so both update and delete actions can be used
config.execution.MASKING_STRICT = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we wouldn't set this in the test since if execution halts mid-test the value won't be reset and could cause cascade failures within other tests. Looks like you're only updating the value here so let's change these as part of a subsequent ticket.

Copy link
Contributor

@seanpreston seanpreston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pattisdr

@seanpreston seanpreston merged commit abf0d90 into main Jun 7, 2022
@seanpreston seanpreston deleted the fidesops_574_restart_from_failure branch June 7, 2022 15:42
sanders41 pushed a commit that referenced this pull request Sep 22, 2022
* WIP Allow restart graph from failure.

- After retries have expired, throw an exception, cancelling remaining tasks, instead of continuing the graph execution.
- On failure, cache the failed step (access or erasure), and the failed collection.
- Add an API endpoint for resuming from failure.
- Refactor the methods used for caching the paused step and collection to share them with new methods to cache the failed step/collection,

* Add API endpoint tests for restarting from failed node. No request body is required.

* Add test that restarting from failure doesn't re-run already-executed nodes.

* Add tests for caching the failed step and collection.

* Fix imports.

* Add minor docs to guides.

* Fix retry tests. We now raise an exception after retries have been exceeded instead of continuing with execution.

* Fix items from rebase with erasure branch.

* Fix items from merge.

* Remove check if status is error because errored privacy requests will exit before we get to this point.

* Sqlalchemy bigquery upgrade experiment.

* Revert "Sqlalchemy bigquery upgrade experiment."

This reverts commit cfc2b79.

* Fix an existing bigquery bug that was revealed after the new failure behavior was added.  We should not build a bigquery update query if there is no data to update- this was incorrectly causing a query to be built that looks like: UPDATE `address` SET WHERE  address_id = 4;

- A failure at the collection level now causes the entire PrivacyRequest to fail, instead of ignoring the failed collection after "x" retries.  The above bug was previously being ignored in the test because the collection error was being suppressed.

* Update stripe erasure tests to only run with config.execution.MASKING_STRICT = False, so both update and delete actions can be performed.  Stripe has some endpoints whose update action is a "delete".  Saas configs will error if there is an attempt to mask but we haven't granted permission to use delete actions.

This test shouldn't have been running with MASKING_STRICT=True, because this particular config requires False for an erasure to run successfully, as there are mixtures of updates/deletes defined.  However, existing behavior that ignored a failed collection was still causing this privacy request to complete.

* Remove the primary key off of hubspot's owners' dataset, so we don't attempt a masking request on that collection. There's intentionally no update or delete configuration defined for owners right now.  This prevents us from trying to run an erasure against that collection for the time being.

(We were previously attempting to run an erasure and getting a failure that was ignored, but new execution behavior doesn't ignore failures.)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
run unsafe ci checks Triggers running of unsafe CI checks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to Restart Graph From Failure
3 participants