DSR 3.0: First Class Tasks + Parallelization #4760

pattisdr · 2024-03-29T19:48:00Z

DSR 3.0 First Class Tasks and Parallelization

Closes #PROD-1895
Closes #PROD-1882

Description Of Changes

Adds the new DSR 3.0 scheduler which supports executing DSR's in parallel and treats individual tasks in the Access, Erasure, and Consent Graphs as first-class objects. The graph is built per usual, but then these nodes are persisted to the database as RequestTasks that are queued and run as celery tasks in parallel. It uses a custom event-driven scheduler, where each node runs, and upon completion, queues its downstream tasks provided that task has all of its upstream tasks completed. The terminator task in each graph also is responsible for queueing the Privacy Request itself from the appropriate checkpoint to wrap up processing. The individual checkpoints for a privacy request still run sequentially but the tasks within a graph run in parallel.

Maintains support for DSR 2.0 which uses Dask, which runs each of tasks in the Access, Erasure, and Consent graphs sequentially and in-memory.

👉 Key changes to pay attention to are in graph_task.py, request_runner_service.py, create_request_tasks.py, and execute_request_tasks.py for DSR 3.0.

DSR 2.0 is largely contained in deprecated_graph_task.py.

Code Changes

Data model

New encrypted PrivacyRequest column, filtered_final_upload that contains encrypted access results we uploaded to the user keyed by policy rule for a limited time
New encrypted PrivacyRequest column access_result_urls to store the access result URL's in case the Privacy Request exits in DSR 3.0 and is requeued, losing access to access result url's in memory (this could happen if a policy had both an access and an erasure rule)
New RequestTasks model. Request tasks are now first class objects in DSR 3.0. Every single node in the access, consent, and erasure graphs (which correspond roughly to a collection) is saved as a RequestTask to the database immediately after the graph is built. RequestTasks store the intermediate data they collect on the RequestTasks themselves, whereas with DSR 2.0 stores them in the cache. For backwards compatibility, we continue to write to the cache regardless.

Endpoints

Removes async callback endpoints that supported the deprecated v1 Manual Connector that assumed sequential task execution . ("/privacy-request/{privacy_request_id}/manual_input" and "/privacy-request/{privacy_request_id}/erasure_confirm"). The underlying deprecated connector has never been exposed in the admin UI and that connector is likewise removed here.
Adds GET /privacy-request/{privacy_request_id}/tasks to get all request tasks associated with a Privacy Request
Adds GET /privacy-request/{privacy_request_id}/requeue to queue a stuck Privacy Request (use with caution)

Internal

Removes sending code to fideslog related to DSR graphs and DSR failures. We are phasing out fideslog anyway and this avoids having to refactor for DSR 3.0 related pieces. (Removes graph/analytics_events.py and graph/graph_differences.py)
Adds Collection.parse_from_task to support hydrating a serialized collection saved in the database under RequestTask.collection into a Collection object.
Add new ExecutionNode class to be used for executing a node. The existing TraversalNode class is now only used for building the graph, not executing it. The ExecutionNode largely only has knowledge of itself and incoming and outgoing data edges which we can create from the database record, unlike the TraversalNode which is aware of the whole graph. DSR 2.0 was previously able to use the TraversalNode because the graph was executed immediately in memory.
Updates connectors to use ExecutionNodes instead of TraversalNodes
Adds new access_runner, erasure_runner, and consent_runner which support running an access, erasure, or consent graph on the DSR 2.0 scheduler (in memory with Dask in sequence) or DSR 3.0 scheduler (in parallel, with persistent request tasks)
GraphTasks are instantiated later, as part of running an individual celery task for a single Node. They are now created from just TaskResources instead of TraversalNodes since they won't have the context of the whole graph.
Logic that writes ExecutionLogs now also updates the new RequestTask status and additionally can mark all nodes that can be reached from the current node as errored if the current node fails.
Likely breaks the Fides Connector when run with DSR 3.0
Stops passing in input_data into the request to mask a node. These are the upstream access results that feed into the current node which were useful for erasure requests on email connectors that did not send access requests directly. This functionality no longer exists.

AP Scheduler Tasks

Recurring task "poll_for_exited_privacy_request_tasks" that runs on a configurable interval that looks to see if all Request Tasks have had a chance to run, but some have errored, and puts the entire Privacy Request in a status of error so it can be re-processed
Weekly task DSR_DATA_REMOVAL that clears out old RequestTasks and selected columns on the Privacy Request model that are older than a configurable TTL containing encrypted PII

Tests

A lot of the DSR tests modified to run both on the DSR 2.0 and DSR 3.0 schedulers
Persist privacy requests in the database and add the correct policy to the privacy request for testing, which is needed for DSR 3.0 because each node is run as a celery task and relies on persisted objects. For testing, we currently do a lot of creating a Privacy Request in memory and the policy we pass into the runner doesn't always match the policy on the privacy request (which is okay for 2.0 but not 3.0)
Policies also need to be persisted for DSR 3.0 testing, so this means we can't use fake data categories - like "A" and B"

Steps to Confirm

The best way we can test this is to try to run DSR's under a lot of different conditions under both DSR 3.0 and DSR 2.0! (With and without the worker as well!) I am continuing to test this week too -

In your toml file, toggle DSR 3.0 to on and task_always_eager to True (to test celery)

[execution]
use_dsr_3_0 = true

[celery]
task_always_eager = true

Upload postgres_example_test_dataset.yml and click Save
Set up postgres, mongodb, and mailchimp connectors - run access and erasure requests and verify data looks as expected
Set up mailchimp transactional and run consent requests
Verify email connector works as expected
In your toml file, turn off DSR 3.0 to use DSR 2.0:

[execution]
use_dsr_3_0 = false

Repeat, and verify you can use DSR 2.0 as expected
Verify exception logs in Admin UI look as expected - except DSR 3.0 has evidence items are running in parallel (if you're running in celery with task_always_eager=true and have worker container up). Here, two tasks are queued at a time:

Pre-Merge Checklist

- When running an access request, after the graph is created, create individual Tasks for every node. - The root node is already marked as completed, but stores the seed data. - Queue the nodes that just need seed data - run_node function rebuilds all the data it needs :( and then runs the task. It then looks at its downstream tasks, and for every downstream task, checks to see if it's upstream tasks are done. If so, it is queued. - When a node is finished, it stores its data on itself - Terminator task stores all the data on itself. When we reach the terminator task, the privacy request is again requeued -

…node up front.

…ched by that node as failed. Run a separate task in background scheduler that looks for privacy requests where all tasks are either completed or errored and marks the entire privacy request as errored. Remove the code path where I started to figure out a graph task that just contained the code for the current node. I'll come back to this, but tabling it for now.

…ream nodes can get queued as well. - Only allow one node for each privacy request for a certain action type? - Instead of using the termination function, pull the final results off of all the completed nodes.

…es the cache seems empty on retry from the worker?

… sometimes the cache is empty on retry! Why? Using persisted identity instead for now. - Revert back to having one *completed* task per action type per privacy request instead of one task. So there can be more tha

… in which they will need to be run -

…o manual data is included in here, and it's not filtered yet by the privacy request's policy, so it differs from our actual upload packet.

- erasure nodes need not only the data collected by the access node of the same name, but also the access data from all of its inputs. - after we queue the erasure nodes, the privacy request exits - after the erasure terminator node wraps up we queue the privacy request again downstream of building the erasure graph - the only upstream dependences for erasure nodes are related to the "erase_after" construct - we need to double check in the new format that we're not creating cycles -

…- this was realized when comparing before and after access results and I was only picking up one address when orders reveals another address for customer 1 in example data. GraphTask.access_request was looking at input keys, but I am trying to move from this concept. Instead, access GraphTask.results.

…re completed or failed.

…sks individually - graph is simple, every node points to root and terminator node. Individual nodes have no dependencies currently.

- I don't think we should build the whole traversal every time we run a node - it's tricky though because a lot of internals assume we have access to the full graph everywhere. Instead: - Save a representation of the collection on the node along with traversal details - Rehydrate in the form of an Execution Node that is used on the Graph Task. - TraversalNodes should only be for building the graph, not executing it. The TraversalNode has awareness of the whole graph.

…y and execute_tasks.py for org purposes, but primarily at this stage to make sure the TraversalNode and the ExecutionNodes are being used in the proper places.

…task is completed, we just look for downstream nodes, and don't actually run the access/consent/erasure task. - Only queue the root node to start. - Only create one node for each collection for each action type, and update the statuses of that node, instead of creating multiple nodes of the same type with different statuses.

… dry - set the task to pending when it is being queued rather than inside the task. - Remove erasure_prereqs from GraphTask.erasure_requese since this is accounted for in the RequestTask.upstream_nodes

…ailed graph - skip over completed nodes and only run pending ones. Because different node paths can reach the terminator node first, if the terminator node has already been reached through one path, the terminator node should not run again - these are tasks that should just be allowed to complete with no operation. On retry we need to retry all downstream tasks, completed or not.

…ting ready tasks that are incomplete. This is similar to how we do it now.

…e database

pattisdr · 2024-04-15T14:10:01Z

@galvana I think this is ready for review! This week I am continuing to work on followup tickets to add things like catch-all schedulers and manage the number of connections but I think the core work is stable. Thank you for offering to review this 🙏

…configuration for removing the request tasks.

src/fides/api/api/v1/endpoints/privacy_request_endpoints.py

galvana

First round of comments!

.fides/db_dataset.yml

tests/ops/integration_tests/saas/connector_runner.py

tests/ops/service/connectors/test_saas_connector.py

src/fides/api/service/connectors/base_connector.py

src/fides/api/task/graph_task.py

src/fides/api/util/saas_util.py

src/fides/api/models/privacy_request.py

src/fides/api/task/execute_request_tasks.py

galvana · 2024-04-17T23:52:12Z

I completed the testing checklist and verified the different integration types work as expected for both DSR 2.0 and 3.0 🙌

- fix data category on yaml file - Fix docstring accidentally removed in connector runner - Fix duplicate variable in test - Log # of request tasks removed in DSR Data Removal Tasks - Add tests for when test_get_data_type_converter is passed in string "None" - Fix poicy -> policy typo - Make PRIVACY_REQUEST_STATUS_CHANGE_POLL use interval scheduler instead

…e still settling out. When restoring, make sure scopes are correct.

pattisdr · 2024-04-19T14:52:53Z

Thank you for your first-pass review @galvana, I've responded to your changes here. For database connection limits, I am looking into that under one of my other tickets, "Scheduler Robustness" to have ways to automatically schedule tasks that get into these weird states -

pattisdr · 2024-04-19T16:59:15Z

I introduced some bugs in my logging - fixing

# Conflicts: # .fides/db_dataset.yml

galvana

Amazing work as usual in such a short amount of time! Thank you for thinking through so many edge cases and for cleanly supporting DSR 2.0 and 3.0 together. Approved!

pattisdr · 2024-04-24T05:05:11Z

Thanks again for all the code review Adrian! 🏆

pattisdr added 30 commits March 16, 2024 17:11

Start calculating all descendants that can be reached by the current …

d5b2200

…node up front.

Fix some bugs so that if a privacy request is re-processed its downst…

0a25647

…ream nodes can get queued as well. - Only allow one node for each privacy request for a certain action type? - Instead of using the termination function, pull the final results off of all the completed nodes.

Improve code comments

e1de985

Use persisted identity data instead of cached identity data - sometim…

a09fb33

…es the cache seems empty on retry from the worker?

Create the nodes in topological sort order to roughly mimic the order…

df3dd61

… in which they will need to be run -

Add an endpoint for privacy request tasks.

3fb4329

Add an endpoint that exposes raw access data collected by the DAG - n…

bfc7c09

…o manual data is included in here, and it's not filtered yet by the privacy request's policy, so it differs from our actual upload packet.

Merge branch 'main' into first_class_tasks_exploration

d09a126

Put data in the same order as input_keys instead for now

c7bff40

Add marking a privacy request as failed if all of its erasure tasks a…

fda5866

…re completed or failed.

Explore building graph for backend consent propagation and running ta…

452ee74

…sks individually - graph is simple, every node points to root and terminator node. Individual nodes have no dependencies currently.

Split out functions from graph_task.py into two files, create_tasks.p…

9e3dd04

…y and execute_tasks.py for org purposes, but primarily at this stage to make sure the TraversalNode and the ExecutionNodes are being used in the proper places.

Make the methods for building access, erasure, and consent tasks more…

a5a3f7f

… dry - set the task to pending when it is being queued rather than inside the task. - Remove erasure_prereqs from GraphTask.erasure_requese since this is accounted for in the RequestTask.upstream_nodes

Dry up some logging.

54a9412

Instead of queuing from the root on reprocessing, queue only the exis…

93a55bf

…ting ready tasks that are incomplete. This is similar to how we do it now.

Remove dask.

91004ee

Cleanup pass -

bec80db

Fix reprocessing bug. Errored tasks need to be marked as pending

e2fd170

Merge branch 'main' into first_class_tasks_exploration

871553b

Bump downrev

5b98d3b

Move some TraversalNodes to ExecutionNodes -

c26e0af

Fix bug in serializing ObjectFields from the stored RequestTask in th…

bb77cb6

…e database

pattisdr requested a review from galvana April 15, 2024 14:09

pattisdr added 2 commits April 16, 2024 07:51

Don't run the request body unless a task has a pending status -

6266d74

Make dsr data removal run daily instead of weekly - and separate the …

cbe9585

…configuration for removing the request tasks.

pattisdr commented Apr 17, 2024

View reviewed changes

src/fides/api/api/v1/endpoints/privacy_request_endpoints.py Show resolved Hide resolved

src/fides/api/api/v1/endpoints/privacy_request_endpoints.py Outdated Show resolved Hide resolved

pattisdr mentioned this pull request Apr 17, 2024

DSR 3.0 Improve External DB Connection Management #4805

Merged

8 tasks

galvana reviewed Apr 17, 2024

View reviewed changes

pattisdr added 2 commits April 18, 2024 12:17

Merge branch 'main' into dsr_parallelization

63959b2

vercel bot deployed to Preview April 18, 2024 17:54 View deployment

Remove experimental endpoint to get access data while requirements ar…

ccfb530

…e still settling out. When restoring, make sure scopes are correct.

pattisdr added 2 commits April 19, 2024 12:12

Fix logging message

83cbcd6

DSR 3.0 Improve External DB Connection Management (#4805)

2bd5665

pattisdr mentioned this pull request Apr 21, 2024

DSR Scheduler Robustness #4817

Merged

14 tasks

pattisdr added 3 commits April 22, 2024 08:54

Merge branch 'main' into dsr_parallelization

08cf713

# Conflicts: # .fides/db_dataset.yml

Formatting -

06c4beb

DSR Scheduler Robustness (#4817)

dff192a

galvana self-requested a review April 23, 2024 19:02

galvana approved these changes Apr 23, 2024

View reviewed changes

pattisdr added 2 commits April 23, 2024 14:38

Merge branch 'main' into dsr_parallelization

b5a479d

Polish pass

930730e

pattisdr force-pushed the dsr_parallelization branch from 8a2fb8e to 930730e Compare April 24, 2024 04:13

pattisdr removed the do not merge Please don't merge yet, bad things will happen if you do label Apr 24, 2024

pattisdr merged commit 8fc3749 into main Apr 24, 2024
44 of 48 checks passed

pattisdr deleted the dsr_parallelization branch April 24, 2024 05:07

pattisdr mentioned this pull request Apr 26, 2024

Change Privacy Request Polling Logging from Info -> Debug #4836

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DSR 3.0: First Class Tasks + Parallelization #4760

DSR 3.0: First Class Tasks + Parallelization #4760

pattisdr commented Mar 29, 2024 •

edited

Loading

pattisdr commented Apr 15, 2024

galvana left a comment

galvana commented Apr 17, 2024

pattisdr commented Apr 19, 2024

pattisdr commented Apr 19, 2024

galvana left a comment

pattisdr commented Apr 24, 2024

DSR 3.0: First Class Tasks + Parallelization #4760

DSR 3.0: First Class Tasks + Parallelization #4760

Conversation

pattisdr commented Mar 29, 2024 • edited Loading

DSR 3.0 First Class Tasks and Parallelization

Description Of Changes

Code Changes

Data model

Endpoints

Internal

AP Scheduler Tasks

Tests

Steps to Confirm

Pre-Merge Checklist

pattisdr commented Apr 15, 2024

galvana left a comment

Choose a reason for hiding this comment

galvana commented Apr 17, 2024

pattisdr commented Apr 19, 2024

pattisdr commented Apr 19, 2024

galvana left a comment

Choose a reason for hiding this comment

pattisdr commented Apr 24, 2024

pattisdr commented Mar 29, 2024 •

edited

Loading