Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSR 3.0: First Class Tasks + Parallelization #4760

Merged
merged 142 commits into from
Apr 24, 2024
Merged

Conversation

pattisdr
Copy link
Contributor

@pattisdr pattisdr commented Mar 29, 2024

DSR 3.0 First Class Tasks and Parallelization

Closes #PROD-1895
Closes #PROD-1882

Description Of Changes

Adds the new DSR 3.0 scheduler which supports executing DSR's in parallel and treats individual tasks in the Access, Erasure, and Consent Graphs as first-class objects. The graph is built per usual, but then these nodes are persisted to the database as RequestTasks that are queued and run as celery tasks in parallel. It uses a custom event-driven scheduler, where each node runs, and upon completion, queues its downstream tasks provided that task has all of its upstream tasks completed. The terminator task in each graph also is responsible for queueing the Privacy Request itself from the appropriate checkpoint to wrap up processing. The individual checkpoints for a privacy request still run sequentially but the tasks within a graph run in parallel.

Maintains support for DSR 2.0 which uses Dask, which runs each of tasks in the Access, Erasure, and Consent graphs sequentially and in-memory.

👉 Key changes to pay attention to are in graph_task.py, request_runner_service.py, create_request_tasks.py, and execute_request_tasks.py for DSR 3.0.

DSR 2.0 is largely contained in deprecated_graph_task.py.

Code Changes

Data model

  • New encrypted PrivacyRequest column, filtered_final_upload that contains encrypted access results we uploaded to the user keyed by policy rule for a limited time
  • New encrypted PrivacyRequest column access_result_urls to store the access result URL's in case the Privacy Request exits in DSR 3.0 and is requeued, losing access to access result url's in memory (this could happen if a policy had both an access and an erasure rule)
  • New RequestTasks model. Request tasks are now first class objects in DSR 3.0. Every single node in the access, consent, and erasure graphs (which correspond roughly to a collection) is saved as a RequestTask to the database immediately after the graph is built. RequestTasks store the intermediate data they collect on the RequestTasks themselves, whereas with DSR 2.0 stores them in the cache. For backwards compatibility, we continue to write to the cache regardless.

Endpoints

  • Removes async callback endpoints that supported the deprecated v1 Manual Connector that assumed sequential task execution . ("/privacy-request/{privacy_request_id}/manual_input" and "/privacy-request/{privacy_request_id}/erasure_confirm"). The underlying deprecated connector has never been exposed in the admin UI and that connector is likewise removed here.
  • Adds GET /privacy-request/{privacy_request_id}/tasks to get all request tasks associated with a Privacy Request
  • Adds GET /privacy-request/{privacy_request_id}/requeue to queue a stuck Privacy Request (use with caution)

Internal

  • Removes sending code to fideslog related to DSR graphs and DSR failures. We are phasing out fideslog anyway and this avoids having to refactor for DSR 3.0 related pieces. (Removes graph/analytics_events.py and graph/graph_differences.py)
  • Adds Collection.parse_from_task to support hydrating a serialized collection saved in the database under RequestTask.collection into a Collection object.
  • Add new ExecutionNode class to be used for executing a node. The existing TraversalNode class is now only used for building the graph, not executing it. The ExecutionNode largely only has knowledge of itself and incoming and outgoing data edges which we can create from the database record, unlike the TraversalNode which is aware of the whole graph. DSR 2.0 was previously able to use the TraversalNode because the graph was executed immediately in memory.
  • Updates connectors to use ExecutionNodes instead of TraversalNodes
  • Adds new access_runner, erasure_runner, and consent_runner which support running an access, erasure, or consent graph on the DSR 2.0 scheduler (in memory with Dask in sequence) or DSR 3.0 scheduler (in parallel, with persistent request tasks)
  • GraphTasks are instantiated later, as part of running an individual celery task for a single Node. They are now created from just TaskResources instead of TraversalNodes since they won't have the context of the whole graph.
  • Logic that writes ExecutionLogs now also updates the new RequestTask status and additionally can mark all nodes that can be reached from the current node as errored if the current node fails.
  • Likely breaks the Fides Connector when run with DSR 3.0
  • Stops passing in input_data into the request to mask a node. These are the upstream access results that feed into the current node which were useful for erasure requests on email connectors that did not send access requests directly. This functionality no longer exists.

AP Scheduler Tasks

  • Recurring task "poll_for_exited_privacy_request_tasks" that runs on a configurable interval that looks to see if all Request Tasks have had a chance to run, but some have errored, and puts the entire Privacy Request in a status of error so it can be re-processed
  • Weekly task DSR_DATA_REMOVAL that clears out old RequestTasks and selected columns on the Privacy Request model that are older than a configurable TTL containing encrypted PII

Tests

  • A lot of the DSR tests modified to run both on the DSR 2.0 and DSR 3.0 schedulers
  • Persist privacy requests in the database and add the correct policy to the privacy request for testing, which is needed for DSR 3.0 because each node is run as a celery task and relies on persisted objects. For testing, we currently do a lot of creating a Privacy Request in memory and the policy we pass into the runner doesn't always match the policy on the privacy request (which is okay for 2.0 but not 3.0)
  • Policies also need to be persisted for DSR 3.0 testing, so this means we can't use fake data categories - like "A" and B"

Steps to Confirm

The best way we can test this is to try to run DSR's under a lot of different conditions under both DSR 3.0 and DSR 2.0! (With and without the worker as well!) I am continuing to test this week too -

  • In your toml file, toggle DSR 3.0 to on and task_always_eager to True (to test celery)
[execution]
use_dsr_3_0 = true

[celery]
task_always_eager = true
  • nox -s dev -- shell postgres mongodb worker
  • In another terminal, cd /fides/clients
  • turbo run dev
  • Setup postgres
  • http://localhost:3001/add-systems
  • Click Add a System
  • Add system name: postgres_example_test_dataset
  • Click Integrations
  • Integration type: PostgreSQL
  • Add credentials for local postgres example container running in docker (these aren't private secrets, these are for the example postgres database, docker/docker-compose.integration-postgres.yml). Test connection
Screenshot 2024-04-15 at 11 04 22 AM
  • Upload postgres_example_test_dataset.yml and click Save
  • Set up postgres, mongodb, and mailchimp connectors - run access and erasure requests and verify data looks as expected
  • Set up mailchimp transactional and run consent requests
  • Verify email connector works as expected
  • In your toml file, turn off DSR 3.0 to use DSR 2.0:
[execution]
use_dsr_3_0 = false
  • Repeat, and verify you can use DSR 2.0 as expected
  • Verify exception logs in Admin UI look as expected - except DSR 3.0 has evidence items are running in parallel (if you're running in celery with task_always_eager=true and have worker container up). Here, two tasks are queued at a time:
Screenshot 2024-04-15 at 11 10 16 AM

Pre-Merge Checklist

  • All CI Pipelines Succeeded
  • Documentation: Internal
  • Issue Requirements are Met
  • Relevant Follow-Up Issues Created
  • Update CHANGELOG.md

pattisdr added 30 commits March 16, 2024 17:11
- When running an access request, after the graph is created, create individual Tasks for every node.
- The root node is already marked as completed, but stores the seed data.
- Queue the nodes that just need seed data
- run_node function rebuilds all the data it needs :( and then runs the task.  It then looks at its downstream tasks, and for every downstream task, checks to see if it's upstream tasks are done. If so, it is queued.
- When a node is finished, it stores its data on itself
- Terminator task stores all the data on itself. When we reach the terminator task, the privacy request is again requeued -
…ched by that node as failed. Run a separate task in background scheduler that looks for privacy requests where all tasks are either completed or errored and marks the entire privacy request as errored.

Remove the code path where I started to figure out a graph task that just contained the code for the current node. I'll come back to this, but tabling it for now.
…ream nodes can get queued as well.

- Only allow one node for each privacy request for a certain action type?
- Instead of using the termination function, pull the final results off of all the completed nodes.
…es the cache seems empty on retry from the worker?
… sometimes the cache is empty on retry! Why? Using persisted identity instead for now.

- Revert back to having one *completed* task per action type per privacy request instead of one task.  So there can be more tha
…o manual data is included in here, and it's not filtered yet by the privacy request's policy, so it differs from our actual upload packet.
- erasure nodes need not only the data collected by the access node of the same name, but also the access data from all of its inputs.
- after we queue the erasure nodes, the privacy request exits
- after the erasure terminator node wraps up we queue the privacy request again downstream of building the erasure graph
- the only upstream dependences for erasure nodes are related to the "erase_after" construct
- we need to double check in the new format that we're not creating cycles -
…- this was realized when comparing before and after access results and I was only picking up one address when orders reveals another address for customer 1 in example data.

GraphTask.access_request was looking at input keys, but I am trying to move from this concept. Instead, access GraphTask.results.
…sks individually - graph is simple, every node points to root and terminator node. Individual nodes have no dependencies currently.
- I don't think we should build the whole traversal every time we run a node - it's tricky though because a lot of internals assume we have access to the full graph everywhere.

Instead:
- Save a representation of the collection on the node along with traversal details
- Rehydrate in the form of an Execution Node that is used on the Graph Task.
- TraversalNodes should only be for building the graph, not executing it. The TraversalNode has awareness of the whole graph.
…y and execute_tasks.py for org purposes, but primarily at this stage to make sure the TraversalNode and the ExecutionNodes are being used in the proper places.
…task is completed, we just look for downstream nodes, and don't actually run the access/consent/erasure task.

- Only queue the root node to start.
- Only create one node for each collection for each action type, and update the statuses of that node, instead of creating multiple nodes of the same type with different statuses.
… dry - set the task to pending when it is being queued rather than inside the task.

- Remove erasure_prereqs from GraphTask.erasure_requese since this is accounted for in the RequestTask.upstream_nodes
…ailed graph - skip over completed nodes and only run pending ones. Because different node paths can reach the terminator node first, if the terminator node has already been reached through one path, the terminator node should not run again - these are tasks that should just be allowed to complete with no operation.

On retry we need to retry all downstream tasks, completed or not.
…ting ready tasks that are incomplete. This is similar to how we do it now.
@pattisdr pattisdr requested a review from galvana April 15, 2024 14:09
@pattisdr
Copy link
Contributor Author

@galvana I think this is ready for review! This week I am continuing to work on followup tickets to add things like catch-all schedulers and manage the number of connections but I think the core work is stable. Thank you for offering to review this 🙏

Copy link
Contributor

@galvana galvana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments!

.fides/db_dataset.yml Outdated Show resolved Hide resolved
tests/ops/integration_tests/saas/connector_runner.py Outdated Show resolved Hide resolved
tests/ops/service/connectors/test_saas_connector.py Outdated Show resolved Hide resolved
src/fides/api/task/graph_task.py Show resolved Hide resolved
src/fides/api/task/graph_task.py Show resolved Hide resolved
src/fides/api/util/saas_util.py Show resolved Hide resolved
src/fides/api/models/privacy_request.py Show resolved Hide resolved
src/fides/api/task/execute_request_tasks.py Show resolved Hide resolved
@galvana
Copy link
Contributor

galvana commented Apr 17, 2024

I completed the testing checklist and verified the different integration types work as expected for both DSR 2.0 and 3.0 🙌

- fix data category on yaml file
- Fix docstring accidentally removed in connector runner
- Fix duplicate variable in test
- Log # of request tasks removed in DSR Data Removal Tasks
- Add tests for when test_get_data_type_converter is passed in string "None"
- Fix poicy -> policy typo
- Make PRIVACY_REQUEST_STATUS_CHANGE_POLL use interval scheduler instead
…e still settling out. When restoring, make sure scopes are correct.
@pattisdr
Copy link
Contributor Author

Thank you for your first-pass review @galvana, I've responded to your changes here. For database connection limits, I am looking into that under one of my other tickets, "Scheduler Robustness" to have ways to automatically schedule tasks that get into these weird states -

@pattisdr
Copy link
Contributor Author

I introduced some bugs in my logging - fixing

@pattisdr pattisdr mentioned this pull request Apr 21, 2024
14 tasks
@galvana galvana self-requested a review April 23, 2024 19:02
Copy link
Contributor

@galvana galvana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work as usual in such a short amount of time! Thank you for thinking through so many edge cases and for cleanly supporting DSR 2.0 and 3.0 together. Approved!

@pattisdr pattisdr force-pushed the dsr_parallelization branch from 8a2fb8e to 930730e Compare April 24, 2024 04:13
@pattisdr pattisdr removed the do not merge Please don't merge yet, bad things will happen if you do label Apr 24, 2024
@pattisdr
Copy link
Contributor Author

Thanks again for all the code review Adrian! 🏆

@pattisdr pattisdr merged commit 8fc3749 into main Apr 24, 2024
44 of 48 checks passed
@pattisdr pattisdr deleted the dsr_parallelization branch April 24, 2024 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run unsafe ci checks Runs fides-related CI checks that require sensitive credentials
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants