Restart Graph from Failure [#574] #578

pattisdr · 2022-05-28T22:23:29Z

Purpose

We currently can't rerun a graph from a failed node, our only choice is to attempt to run a completely new privacy request.

If a node in the graph fails a certain number of times, we continue with graph execution, just passing in empty values downstream to dependent nodes. This is problematic for a couple of reasons: we may not have properly retrieved/masked data on the failed collection, or downstream collections. Second, if we want to run another privacy request to rectify this, we may no longer be able to execute the graph because data has been destroyed.

Changes

Raise an exception when a graph node fails (instead of ignoring after "x" retries), and cancel remaining graph tasks. Sending a POST request to /privacy-request/{privacy_request_id}/retry will restart the graph, only running remaining graph tasks.
Fix existing issues with Bigquery/Hubspot/Stripe either in the code or the tests, whose collection failures were previously failing silently. Now that a collection failure causes the entire privacy request to fail, more errors are being raised than there were previously.

Note

This doesn't yet expose to the frontend which node is the "failed" node; that will be in follow-up Surface to user how to Pause/Resume Privacy Request #570
This still preserves the "retry" behavior, except after "x" retries we re-raise an exception and exit, instead of continuing on.

Checklist

Ticket

Fixes #574

- After retries have expired, throw an exception, cancelling remaining tasks, instead of continuing the graph execution. - On failure, cache the failed step (access or erasure), and the failed collection. - Add an API endpoint for resuming from failure. - Refactor the methods used for caching the paused step and collection to share them with new methods to cache the failed step/collection,

…dy is required.

… nodes.

…ceeded instead of continuing with execution.

src/fidesops/models/privacy_request.py

pattisdr · 2022-06-01T14:34:44Z

@ethyca/docs-authors minor edit to guide added here

src/fidesops/service/privacy_request/request_runner_service.py

seanpreston · 2022-06-02T04:17:16Z

Thanks @pattisdr, I just have that one product level question re: the webhooks

# Conflicts: # CHANGELOG.md

… exit before we get to this point.

CHANGELOG.md

This reverts commit cfc2b79.

…behavior was added. We should not build a bigquery update query if there is no data to update- this was incorrectly causing a query to be built that looks like: UPDATE `address` SET WHERE address_id = 4; - A failure at the collection level now causes the entire PrivacyRequest to fail, instead of ignoring the failed collection after "x" retries. The above bug was previously being ignored in the test because the collection error was being suppressed.

src/fidesops/service/connectors/query_config.py

…_STRICT = False, so both update and delete actions can be performed. Stripe has some endpoints whose update action is a "delete". Saas configs will error if there is an attempt to mask but we haven't granted permission to use delete actions. This test shouldn't have been running with MASKING_STRICT=True, because this particular config requires False for an erasure to run successfully, as there are mixtures of updates/deletes defined. However, existing behavior that ignored a failed collection was still causing this privacy request to complete.

…attempt a masking request on that collection. There's intentionally no update or delete configuration defined for owners right now. This prevents us from trying to run an erasure against that collection for the time being. (We were previously attempting to run an erasure and getting a failure that was ignored, but new execution behavior doesn't ignore failures.)

pattisdr · 2022-06-06T19:33:18Z

tests/integration_tests/saas/test_stripe_task.py

-    # run erasure with MASKING_STRICT to execute the update actions
-
-    config.execution.MASKING_STRICT = True
+    # Run erasure with masking_strict = False so both update and delete actions can be used
+    config.execution.MASKING_STRICT = False



Stripe tests were being run with config.execution.MASKING_STRICT = True which is invalid for Stripe, because there are both updates and deletes defined in the config. Nodes with delete-only configs were failing silently and then Stripe tests were re-run with config.execution.MASKING_STRICT = False below to get delete-specific behavior.

Because a collection no longer fails silently, we were seeing failures here. To address, we can just run a single erasure request with config.execution.MASKING_STRICT = False so stripe can use the update if defined, otherwise it uses the delete. The counts below have been updated to reflect this.

Stripe tests were incorrectly being run with config.execution.MASKING_STRICT = True which is invalid for Stripe, because there are both updates and deletes defined in the config. Nodes with delete-only configs were failing silently and then Stripe tests were re-run with config.execution.MASKING_STRICT = False below to get delete-specific behavior.

That's a really good catch @pattisdr — thanks

pattisdr · 2022-06-06T19:34:30Z

data/saas/dataset/hubspot_dataset.yml

@@ -91,7 +91,6 @@ dataset:
          - name: id
            data_categories: [user.derived.identifiable.unique_id]
            fidesops_meta:
-              primary_key: True


We currently shouldn't attempt to run an erasure against hubspot owners' endpoint. #361. Our tests were attempting to run this and failing silently.

This PR doesn't allow it to fail silently anymore, so this adjustment prevents us from running an erasure against hubspot owners' until we can sort out how to connect to that endpoint.

seanpreston · 2022-06-07T15:39:53Z

tests/integration_tests/saas/test_stripe_task.py

-
-    config.execution.MASKING_STRICT = True
+    # Run erasure with masking_strict = False so both update and delete actions can be used
+    config.execution.MASKING_STRICT = False


Ideally we wouldn't set this in the test since if execution halts mid-test the value won't be reset and could cause cascade failures within other tests. Looks like you're only updating the value here so let's change these as part of a subsequent ticket.

seanpreston

Thanks @pattisdr

* WIP Allow restart graph from failure. - After retries have expired, throw an exception, cancelling remaining tasks, instead of continuing the graph execution. - On failure, cache the failed step (access or erasure), and the failed collection. - Add an API endpoint for resuming from failure. - Refactor the methods used for caching the paused step and collection to share them with new methods to cache the failed step/collection, * Add API endpoint tests for restarting from failed node. No request body is required. * Add test that restarting from failure doesn't re-run already-executed nodes. * Add tests for caching the failed step and collection. * Fix imports. * Add minor docs to guides. * Fix retry tests. We now raise an exception after retries have been exceeded instead of continuing with execution. * Fix items from rebase with erasure branch. * Fix items from merge. * Remove check if status is error because errored privacy requests will exit before we get to this point. * Sqlalchemy bigquery upgrade experiment. * Revert "Sqlalchemy bigquery upgrade experiment." This reverts commit cfc2b79. * Fix an existing bigquery bug that was revealed after the new failure behavior was added. We should not build a bigquery update query if there is no data to update- this was incorrectly causing a query to be built that looks like: UPDATE `address` SET WHERE address_id = 4; - A failure at the collection level now causes the entire PrivacyRequest to fail, instead of ignoring the failed collection after "x" retries. The above bug was previously being ignored in the test because the collection error was being suppressed. * Update stripe erasure tests to only run with config.execution.MASKING_STRICT = False, so both update and delete actions can be performed. Stripe has some endpoints whose update action is a "delete". Saas configs will error if there is an attempt to mask but we haven't granted permission to use delete actions. This test shouldn't have been running with MASKING_STRICT=True, because this particular config requires False for an erasure to run successfully, as there are mixtures of updates/deletes defined. However, existing behavior that ignored a failed collection was still causing this privacy request to complete. * Remove the primary key off of hubspot's owners' dataset, so we don't attempt a masking request on that collection. There's intentionally no update or delete configuration defined for owners right now. This prevents us from trying to run an erasure against that collection for the time being. (We were previously attempting to run an erasure and getting a failure that was ignored, but new execution behavior doesn't ignore failures.)

pattisdr linked an issue May 31, 2022 that may be closed by this pull request

Ability to Restart Graph From Failure #574

Closed

pattisdr marked this pull request as ready for review May 31, 2022 14:06

pattisdr added the DON'T MERGE label May 31, 2022

Base automatically changed from fidesops_522_pause_erasure to main June 1, 2022 13:34

pattisdr force-pushed the fidesops_574_restart_from_failure branch from 4ad69cd to 25404fc Compare June 1, 2022 14:03

pattisdr added 8 commits June 1, 2022 09:07

Add API endpoint tests for restarting from failed node. No request bo…

2d74353

…dy is required.

Add test that restarting from failure doesn't re-run already-executed…

33fdf3f

… nodes.

Add tests for caching the failed step and collection.

cc917bc

Fix imports.

abe339f

Add minor docs to guides.

5f5c49b

Fix retry tests. We now raise an exception after retries have been ex…

d7a7702

…ceeded instead of continuing with execution.

Fix items from rebase with erasure branch.

1ffd401

pattisdr force-pushed the fidesops_574_restart_from_failure branch from 25404fc to 1ffd401 Compare June 1, 2022 14:09

Fix items from merge.

df2220b

pattisdr commented Jun 1, 2022

View reviewed changes

src/fidesops/models/privacy_request.py Show resolved Hide resolved

pattisdr removed the DON'T MERGE label Jun 1, 2022

seanpreston self-assigned this Jun 1, 2022

seanpreston reviewed Jun 2, 2022

View reviewed changes

src/fidesops/service/privacy_request/request_runner_service.py Show resolved Hide resolved

pattisdr mentioned this pull request Jun 2, 2022

Cache/Surface Resume/Restart Privacy Request Details [#574] #591

Merged

10 tasks

pattisdr added 2 commits June 3, 2022 12:07

Merge branch 'main' into fidesops_574_restart_from_failure

e245a2c

# Conflicts: # CHANGELOG.md

Remove check if status is error because errored privacy requests will…

e7d8f7f

… exit before we get to this point.

pattisdr commented Jun 3, 2022

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

pattisdr added the run unsafe ci checks Triggers running of unsafe CI checks label Jun 3, 2022

Sqlalchemy bigquery upgrade experiment.

cfc2b79

pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jun 3, 2022

Revert "Sqlalchemy bigquery upgrade experiment."

ce8ff4d

This reverts commit cfc2b79.

pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jun 3, 2022

pattisdr mentioned this pull request Jun 3, 2022

Main Test Failures #595

Closed

10 tasks

pattisdr added the DON'T MERGE label Jun 3, 2022

pattisdr marked this pull request as draft June 3, 2022 20:12

pattisdr added 2 commits June 6, 2022 11:38

Merge branch 'main' into fidesops_574_restart_from_failure

b958b2d

pattisdr commented Jun 6, 2022

View reviewed changes

src/fidesops/service/connectors/query_config.py Show resolved Hide resolved

pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jun 6, 2022

conceptualshark approved these changes Jun 6, 2022

View reviewed changes

pattisdr added 2 commits June 6, 2022 14:23

pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jun 6, 2022

pattisdr commented Jun 6, 2022

View reviewed changes

Merge main - conflicts changelog.

ec4ed2e

pattisdr marked this pull request as ready for review June 6, 2022 19:48

pattisdr removed the DON'T MERGE label Jun 6, 2022

seanpreston reviewed Jun 7, 2022

View reviewed changes

seanpreston approved these changes Jun 7, 2022

View reviewed changes

seanpreston merged commit abf0d90 into main Jun 7, 2022

seanpreston deleted the fidesops_574_restart_from_failure branch June 7, 2022 15:42

pattisdr mentioned this pull request Jun 7, 2022

[Datastore Management] Disable/Delete datastore BACKEND #602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart Graph from Failure [#574] #578

Restart Graph from Failure [#574] #578

pattisdr commented May 28, 2022 •

edited

Loading

pattisdr commented Jun 1, 2022

seanpreston commented Jun 2, 2022

pattisdr Jun 6, 2022 •

edited

Loading

seanpreston Jun 7, 2022

pattisdr Jun 6, 2022

seanpreston Jun 7, 2022

seanpreston left a comment

Restart Graph from Failure [#574] #578

Restart Graph from Failure [#574] #578

Conversation

pattisdr commented May 28, 2022 • edited Loading

Purpose

Changes

Note

Checklist

Ticket

pattisdr commented Jun 1, 2022

seanpreston commented Jun 2, 2022

pattisdr Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

seanpreston Jun 7, 2022

Choose a reason for hiding this comment

pattisdr Jun 6, 2022

Choose a reason for hiding this comment

seanpreston Jun 7, 2022

Choose a reason for hiding this comment

seanpreston left a comment

Choose a reason for hiding this comment

pattisdr commented May 28, 2022 •

edited

Loading

pattisdr Jun 6, 2022 •

edited

Loading