-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DSR 3.0: First Class Tasks + Parallelization #4760
Conversation
- When running an access request, after the graph is created, create individual Tasks for every node. - The root node is already marked as completed, but stores the seed data. - Queue the nodes that just need seed data - run_node function rebuilds all the data it needs :( and then runs the task. It then looks at its downstream tasks, and for every downstream task, checks to see if it's upstream tasks are done. If so, it is queued. - When a node is finished, it stores its data on itself - Terminator task stores all the data on itself. When we reach the terminator task, the privacy request is again requeued -
…ched by that node as failed. Run a separate task in background scheduler that looks for privacy requests where all tasks are either completed or errored and marks the entire privacy request as errored. Remove the code path where I started to figure out a graph task that just contained the code for the current node. I'll come back to this, but tabling it for now.
…ream nodes can get queued as well. - Only allow one node for each privacy request for a certain action type? - Instead of using the termination function, pull the final results off of all the completed nodes.
…es the cache seems empty on retry from the worker?
… sometimes the cache is empty on retry! Why? Using persisted identity instead for now. - Revert back to having one *completed* task per action type per privacy request instead of one task. So there can be more tha
… in which they will need to be run -
…o manual data is included in here, and it's not filtered yet by the privacy request's policy, so it differs from our actual upload packet.
- erasure nodes need not only the data collected by the access node of the same name, but also the access data from all of its inputs. - after we queue the erasure nodes, the privacy request exits - after the erasure terminator node wraps up we queue the privacy request again downstream of building the erasure graph - the only upstream dependences for erasure nodes are related to the "erase_after" construct - we need to double check in the new format that we're not creating cycles -
…- this was realized when comparing before and after access results and I was only picking up one address when orders reveals another address for customer 1 in example data. GraphTask.access_request was looking at input keys, but I am trying to move from this concept. Instead, access GraphTask.results.
…re completed or failed.
…sks individually - graph is simple, every node points to root and terminator node. Individual nodes have no dependencies currently.
- I don't think we should build the whole traversal every time we run a node - it's tricky though because a lot of internals assume we have access to the full graph everywhere. Instead: - Save a representation of the collection on the node along with traversal details - Rehydrate in the form of an Execution Node that is used on the Graph Task. - TraversalNodes should only be for building the graph, not executing it. The TraversalNode has awareness of the whole graph.
…y and execute_tasks.py for org purposes, but primarily at this stage to make sure the TraversalNode and the ExecutionNodes are being used in the proper places.
…task is completed, we just look for downstream nodes, and don't actually run the access/consent/erasure task. - Only queue the root node to start. - Only create one node for each collection for each action type, and update the statuses of that node, instead of creating multiple nodes of the same type with different statuses.
… dry - set the task to pending when it is being queued rather than inside the task. - Remove erasure_prereqs from GraphTask.erasure_requese since this is accounted for in the RequestTask.upstream_nodes
…ailed graph - skip over completed nodes and only run pending ones. Because different node paths can reach the terminator node first, if the terminator node has already been reached through one path, the terminator node should not run again - these are tasks that should just be allowed to complete with no operation. On retry we need to retry all downstream tasks, completed or not.
…ting ready tasks that are incomplete. This is similar to how we do it now.
@galvana I think this is ready for review! This week I am continuing to work on followup tickets to add things like catch-all schedulers and manage the number of connections but I think the core work is stable. Thank you for offering to review this 🙏 |
…configuration for removing the request tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of comments!
I completed the testing checklist and verified the different integration types work as expected for both DSR 2.0 and 3.0 🙌 |
- fix data category on yaml file - Fix docstring accidentally removed in connector runner - Fix duplicate variable in test - Log # of request tasks removed in DSR Data Removal Tasks - Add tests for when test_get_data_type_converter is passed in string "None" - Fix poicy -> policy typo - Make PRIVACY_REQUEST_STATUS_CHANGE_POLL use interval scheduler instead
…e still settling out. When restoring, make sure scopes are correct.
Thank you for your first-pass review @galvana, I've responded to your changes here. For database connection limits, I am looking into that under one of my other tickets, "Scheduler Robustness" to have ways to automatically schedule tasks that get into these weird states - |
I introduced some bugs in my logging - fixing |
# Conflicts: # .fides/db_dataset.yml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work as usual in such a short amount of time! Thank you for thinking through so many edge cases and for cleanly supporting DSR 2.0 and 3.0 together. Approved!
8a2fb8e
to
930730e
Compare
Thanks again for all the code review Adrian! 🏆 |
DSR 3.0 First Class Tasks and Parallelization
Closes #PROD-1895
Closes #PROD-1882
Description Of Changes
Adds the new DSR 3.0 scheduler which supports executing DSR's in parallel and treats individual tasks in the Access, Erasure, and Consent Graphs as first-class objects. The graph is built per usual, but then these nodes are persisted to the database as
RequestTasks
that are queued and run as celery tasks in parallel. It uses a custom event-driven scheduler, where each node runs, and upon completion, queues its downstream tasks provided that task has all of its upstream tasks completed. The terminator task in each graph also is responsible for queueing the Privacy Request itself from the appropriate checkpoint to wrap up processing. The individual checkpoints for a privacy request still run sequentially but the tasks within a graph run in parallel.Maintains support for DSR 2.0 which uses Dask, which runs each of tasks in the Access, Erasure, and Consent graphs sequentially and in-memory.
👉 Key changes to pay attention to are in
graph_task.py
,request_runner_service.py
,create_request_tasks.py
, andexecute_request_tasks.py
for DSR 3.0.DSR 2.0 is largely contained in
deprecated_graph_task.py
.Code Changes
Data model
filtered_final_upload
that contains encrypted access results we uploaded to the user keyed by policy rule for a limited timeaccess_result_urls
to store the access result URL's in case the Privacy Request exits in DSR 3.0 and is requeued, losing access to access result url's in memory (this could happen if a policy had both an access and an erasure rule)RequestTasks
model. Request tasks are now first class objects in DSR 3.0. Every single node in the access, consent, and erasure graphs (which correspond roughly to a collection) is saved as a RequestTask to the database immediately after the graph is built. RequestTasks store the intermediate data they collect on the RequestTasks themselves, whereas with DSR 2.0 stores them in the cache. For backwards compatibility, we continue to write to the cache regardless.Endpoints
/privacy-request/{privacy_request_id}/tasks
to get all request tasks associated with a Privacy Request/privacy-request/{privacy_request_id}/requeue
to queue a stuck Privacy Request (use with caution)Internal
Collection.parse_from_task
to support hydrating a serialized collection saved in the database underRequestTask.collection
into aCollection
object.ExecutionNode
class to be used for executing a node. The existingTraversalNode
class is now only used for building the graph, not executing it. The ExecutionNode largely only has knowledge of itself and incoming and outgoing data edges which we can create from the database record, unlike the TraversalNode which is aware of the whole graph. DSR 2.0 was previously able to use the TraversalNode because the graph was executed immediately in memory.access_runner
,erasure_runner
, andconsent_runner
which support running an access, erasure, or consent graph on the DSR 2.0 scheduler (in memory with Dask in sequence) or DSR 3.0 scheduler (in parallel, with persistent request tasks)AP Scheduler Tasks
DSR_DATA_REMOVAL
that clears out old RequestTasks and selected columns on the Privacy Request model that are older than a configurable TTL containing encrypted PIITests
Steps to Confirm
The best way we can test this is to try to run DSR's under a lot of different conditions under both DSR 3.0 and DSR 2.0! (With and without the worker as well!) I am continuing to test this week too -
nox -s dev -- shell postgres mongodb worker
cd /fides/clients
turbo run dev
postgres_example_test_dataset
postgres_example_test_dataset.yml
and click SavePre-Merge Checklist
CHANGELOG.md