Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSR 3.0: First Class Tasks + Parallelization #4760

Merged
merged 142 commits into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from 129 commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
a9d73bb
Rapid POC - create Tasks model.
pattisdr Mar 16, 2024
d5b2200
Start calculating all descendants that can be reached by the current …
pattisdr Mar 17, 2024
86764ab
If a current node fails, mark it and every descendant that can be rea…
pattisdr Mar 17, 2024
0a25647
Fix some bugs so that if a privacy request is re-processed its downst…
pattisdr Mar 17, 2024
e1de985
Improve code comments
pattisdr Mar 17, 2024
a09fb33
Use persisted identity data instead of cached identity data - sometim…
pattisdr Mar 17, 2024
f105f15
The identity data needs to be there to build that initial traversal -…
pattisdr Mar 17, 2024
df3dd61
Create the nodes in topological sort order to roughly mimic the order…
pattisdr Mar 17, 2024
3fb4329
Add an endpoint for privacy request tasks.
pattisdr Mar 17, 2024
bfc7c09
Add an endpoint that exposes raw access data collected by the DAG - n…
pattisdr Mar 17, 2024
aceb9aa
POC investigating building custom scheduler for erasures -
pattisdr Mar 19, 2024
d09a126
Merge branch 'main' into first_class_tasks_exploration
pattisdr Mar 20, 2024
65eeba1
Right - upstream node order matters when going into downstream nodes …
pattisdr Mar 20, 2024
c7bff40
Put data in the same order as input_keys instead for now
pattisdr Mar 20, 2024
fda5866
Add marking a privacy request as failed if all of its erasure tasks a…
pattisdr Mar 20, 2024
452ee74
Explore building graph for backend consent propagation and running ta…
pattisdr Mar 21, 2024
7618cc9
Large WIP refactor - breaking increment.
pattisdr Mar 22, 2024
9e3dd04
Split out functions from graph_task.py into two files, create_tasks.p…
pattisdr Mar 22, 2024
b5ddf35
Refactor so individual task functions can be run multiple times - if …
pattisdr Mar 22, 2024
a5a3f7f
Make the methods for building access, erasure, and consent tasks more…
pattisdr Mar 22, 2024
54a9412
Dry up some logging.
pattisdr Mar 22, 2024
d13d001
Fix new bugs introduced with reprocessing - when we're re-running a f…
pattisdr Mar 22, 2024
93a55bf
Instead of queuing from the root on reprocessing, queue only the exis…
pattisdr Mar 23, 2024
91004ee
Remove dask.
pattisdr Mar 23, 2024
bec80db
Cleanup pass -
pattisdr Mar 23, 2024
e2fd170
Fix reprocessing bug. Errored tasks need to be marked as pending
pattisdr Mar 23, 2024
871553b
Merge branch 'main' into first_class_tasks_exploration
pattisdr Mar 23, 2024
5b98d3b
Bump downrev
pattisdr Mar 23, 2024
c26e0af
Move some TraversalNodes to ExecutionNodes -
pattisdr Mar 23, 2024
bb77cb6
Fix bug in serializing ObjectFields from the stored RequestTask in th…
pattisdr Mar 23, 2024
f589d46
Remove this hack after I fixed it properly on main.
pattisdr Mar 23, 2024
2fe7df6
Remove some methods that have shifted to other classes or I think are…
pattisdr Mar 23, 2024
e15ad77
Comment out moved function imports in tests so the collect-tests step…
pattisdr Mar 23, 2024
c3974d1
Instead of having new TaskStatuses, share with existing ExecutionLogS…
pattisdr Mar 23, 2024
883272c
Experimenting with how async work ties in - pausing a task, resuming …
pattisdr Mar 23, 2024
ade8fea
Fix bug - skipped tasks are complete tasks, so shouldn't be considere…
pattisdr Mar 24, 2024
6103444
Remove some of the caching methods from TaskResources and stop cachin…
pattisdr Mar 24, 2024
bdf4485
Some sorting/formatting
pattisdr Mar 24, 2024
86cd579
Store access data/data in erasure format as strings using json.dump t…
pattisdr Mar 24, 2024
f0504f3
Remove write_execution_log from TaskResources - just trying to keep t…
pattisdr Mar 24, 2024
77b1744
Store data collected by individual tasks as encrypted. remove separat…
pattisdr Mar 24, 2024
008c802
Some formatting -
pattisdr Mar 24, 2024
7986d8f
Restore making run_privacy_requst async and adding the @sync decorato…
pattisdr Mar 24, 2024
c91e8bc
As we upload the filtered access results to the user, combined with t…
pattisdr Mar 24, 2024
a8d8910
Add a task that removes access data and data for erasures after the t…
pattisdr Mar 24, 2024
f08f712
Update POC DSR_DATA_REMOVAL task to just delete old request tasks ins…
pattisdr Mar 24, 2024
6c66e49
Remove other cache_data_use_map instances
pattisdr Mar 24, 2024
d41f713
Remove the async methods again - need to explore this further - getti…
pattisdr Mar 24, 2024
92d5894
Don't log the collection when caching failed checkpoint details now t…
pattisdr Mar 24, 2024
f986149
Catch errors if we can't hydrate the GraphTask from the database. Th…
pattisdr Mar 24, 2024
9a1f2df
Formatting -
pattisdr Mar 24, 2024
9e71444
Set rows_masked before commit that runs in update_status -
pattisdr Mar 24, 2024
ead8979
Fix the dry run queries - we have to mock different objects now -
pattisdr Mar 25, 2024
d3a141d
Remove the unused "original" Manual Connector that was later replaced…
pattisdr Mar 25, 2024
78f3b2d
Experiment with a separate endpoint if we had a scheduling error and …
pattisdr Mar 25, 2024
e63e214
Making polling interval configurable.
pattisdr Mar 25, 2024
8b1feb8
Make request task, a first class object now, an argument into retriev…
pattisdr Mar 25, 2024
caee1ce
WIP Restore old task runners and reinstall Dask.
pattisdr Mar 29, 2024
b072f02
Restore removed pieces of caching data use field map.
pattisdr Mar 29, 2024
c32fcf2
Adjust test import.
pattisdr Mar 29, 2024
b8af08a
Fix serializing Collections with "after" and "erase_after" -
pattisdr Mar 29, 2024
e8e7c25
Update some tests to acommodate new retrieve_data, mask_data signatur…
pattisdr Mar 29, 2024
9a1d7be
Merge branch 'main' into dsr_parallelization
pattisdr Mar 29, 2024
1503660
Address fixing existing tests after restoring dask runner - revert ge…
pattisdr Mar 30, 2024
3b4a5dd
Update tests to reflect changes made
pattisdr Mar 31, 2024
e27475a
Linting pass
pattisdr Mar 31, 2024
e57252b
Repoint tasks to request runner -
pattisdr Mar 31, 2024
801cd4c
POC test_access_runner - runs access request with both DSR 2.0 and DS…
pattisdr Mar 31, 2024
79bdd23
Remove endpoints resuming privacy request from a specific paused coll…
pattisdr Mar 31, 2024
d6389da
Repoint tests of contains_field from Node to Collection.
pattisdr Mar 31, 2024
f5c22df
Add initial tests for create_tasks -
pattisdr Apr 1, 2024
46fe4c2
Add tests on erase_after and erasure order around how upstream and do…
pattisdr Apr 2, 2024
cf16de0
Expand test coverage for DSR 3.0.
pattisdr Apr 3, 2024
320e574
Merge branch 'main' into dsr_parallelization
pattisdr Apr 3, 2024
7b7226b
Work on adjusting tests that were relying on postgres databases to be…
pattisdr Apr 3, 2024
614d9e4
Add erasure_runner_tester and start testing sql_task tests with both …
pattisdr Apr 3, 2024
0601564
Work on address mongo tests to run with both DSR 2.0 and DSR 3.0. So…
pattisdr Apr 3, 2024
6f838d0
Fixing bug in consent graph if no actual nodes - this was caught when…
pattisdr Apr 3, 2024
aa572b4
Handle edge case where a privacy request has a policy with both acces…
pattisdr Apr 4, 2024
6a58cd2
Fix output of privacy request.get_raw_masking_counts()
pattisdr Apr 4, 2024
3dd0b06
Work on execution tests which have some edge cases with execution -
pattisdr Apr 4, 2024
7fd5563
Build a consent runner tester that allows testing both DSR 2.0 and 3.0
pattisdr Apr 4, 2024
a73b14a
Update wunderkind to use new consent runner that uses both schedulers.
pattisdr Apr 4, 2024
f6a4937
Saving access urls meant mocks needed to take this into account.
pattisdr Apr 4, 2024
3d76cff
Update new request tasks tests to be more repeatable so they can rege…
pattisdr Apr 4, 2024
cb7c517
Geez. Rename "queue_privacy_request" boolean to privacy_request_proce…
pattisdr Apr 4, 2024
fb26566
Start switching runners to test runners for saas connectors.
pattisdr Apr 4, 2024
68aa076
Merge branch 'main' into dsr_parallelization
pattisdr Apr 4, 2024
3ba43e7
Remove saas erasure order datasets - their location is elsewhere unde…
pattisdr Apr 5, 2024
596c624
Good example of a test that needed a lot of tweaks to run on both sch…
pattisdr Apr 5, 2024
a3358fa
Get Adyen test working with connector runner.
pattisdr Apr 5, 2024
42ec89c
Pass to put saas connector tests running on DSR 2.0 and 3.0
pattisdr Apr 8, 2024
caea586
Extend DSR 3.0 testing
pattisdr Apr 9, 2024
f2c662e
Merge branch 'main' into dsr_parallelization
pattisdr Apr 9, 2024
786fe19
Bump downrev after main merge.
pattisdr Apr 9, 2024
ceacb89
Add unit tests and integration tests for GET Request Tasks and GET Ac…
pattisdr Apr 9, 2024
9e5c706
Add API tests for queuing a task manually.
pattisdr Apr 9, 2024
305b510
Linting pass -
pattisdr Apr 9, 2024
8a15ad0
Rename Request Task consent success to consent sent - to better refle…
pattisdr Apr 9, 2024
3c70e17
Undo commit of migration backups and mongo sample moved to root direc…
pattisdr Apr 9, 2024
02039f0
Add tests for Collection.json() and its reverse, Collection.parse_fro…
pattisdr Apr 9, 2024
03b5b44
Move format_traversal_details_for_save onto the TraversalNode. Add l…
pattisdr Apr 10, 2024
5120b07
Remove RequestTask.callback_succeeded. This was for async connector e…
pattisdr Apr 10, 2024
ba39664
Merge main -
pattisdr Apr 10, 2024
6e68a7f
Add testing for scheduled tasks that mark Privacy Requests as errored…
pattisdr Apr 10, 2024
e3caf7f
Expand test coverage for new methods on Privacy Request and new model…
pattisdr Apr 10, 2024
d914750
Remove some unused methods for the v1 manual connector.
pattisdr Apr 10, 2024
a2a963d
Improve code comments on request runner service.
pattisdr Apr 10, 2024
6d2f37a
Expand test coverage for creating request tasks.
pattisdr Apr 10, 2024
f1db17b
Dry up creating a request task from a Traversal node with function ad…
pattisdr Apr 10, 2024
915b6b0
Add further testing of marking the current and downstream nodes as fa…
pattisdr Apr 11, 2024
d2e18ed
Update code comments on graph_task.
pattisdr Apr 11, 2024
242ea91
Fix copy/paste error -
pattisdr Apr 11, 2024
c2bcc79
Firebase auth tests need to clear the identities from the cache - bec…
pattisdr Apr 11, 2024
ce010c7
Remove await from access_runner_tester -
pattisdr Apr 11, 2024
f31b839
Address failing onsignal and shopify tests related to running erasure…
pattisdr Apr 11, 2024
6e00d4f
Even if scheduler is configured to use DSR 3.0, override with DSR 2.0…
pattisdr Apr 11, 2024
b9eea3b
Update docstrings.
pattisdr Apr 12, 2024
c5153e5
Merge branch 'main' into dsr_parallelization
pattisdr Apr 12, 2024
d4e9959
Get rid of passing in upstream inputs into the current erasure node -…
pattisdr Apr 12, 2024
ccf63a9
Privacy request fixture caches an email and phone identity, but the c…
pattisdr Apr 12, 2024
ca8682b
Continued test fixes.
pattisdr Apr 12, 2024
c4dc912
Make the ExecutionNode contextualizable.
pattisdr Apr 12, 2024
32531c9
Merge branch 'main' into dsr_parallelization
pattisdr Apr 12, 2024
605e7b7
Pin setuptools version
pattisdr Apr 12, 2024
fa491c3
Fix connector runner to pass in privacy request id if supplied -
pattisdr Apr 12, 2024
3e84d92
Removing onesignal test on DSR 2.0 - let's just test on DSR 3.0, fixt…
pattisdr Apr 12, 2024
6ab9f5e
Revert "Pin setuptools version"
pattisdr Apr 12, 2024
f6d749e
Merge branch 'main' into dsr_parallelization
pattisdr Apr 12, 2024
51a45ca
Fix typo in docstring.
pattisdr Apr 14, 2024
6266d74
Don't run the request body unless a task has a pending status -
pattisdr Apr 15, 2024
cbe9585
Make dsr data removal run daily instead of weekly - and separate the …
pattisdr Apr 15, 2024
cc8ceed
Respond to a round of CR:
pattisdr Apr 18, 2024
63959b2
Merge branch 'main' into dsr_parallelization
pattisdr Apr 18, 2024
ccfb530
Remove experimental endpoint to get access data while requirements ar…
pattisdr Apr 18, 2024
83cbcd6
Fix logging message
pattisdr Apr 19, 2024
2bd5665
DSR 3.0 Improve External DB Connection Management (#4805)
pattisdr Apr 20, 2024
08cf713
Merge branch 'main' into dsr_parallelization
pattisdr Apr 22, 2024
06c4beb
Formatting -
pattisdr Apr 22, 2024
dff192a
DSR Scheduler Robustness (#4817)
pattisdr Apr 23, 2024
b5a479d
Merge branch 'main' into dsr_parallelization
pattisdr Apr 23, 2024
930730e
Polish pass
pattisdr Apr 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .fides/db_dataset.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1276,6 +1276,8 @@ dataset:
- name: privacyrequest
data_categories: []
fields:
- name: access_result_urls
data_categories: [user]
- name: cancel_reason
data_categories: [system.operations]
- name: canceled_at
Expand All @@ -1289,6 +1291,8 @@ dataset:
data_categories: [system.operations]
- name: external_id
data_categories: [system.operations]
- name: filtered_final_upload
data_categories: [user]
- name: finished_processing_at
data_categories: [system.operations]
- name: id
Expand Down Expand Up @@ -2077,3 +2081,41 @@ dataset:
data_categories: [system.operations]
- name: single_row
data_categories: [system.operations]
- name: requesttask
fields:
- name: access_data
data_categories: [user]
- name: action_type
data_categories: [system]
- name: all_descendant_tasks
data_categories: [system]
- name: collection
data_categories: [system]
- name: collection_address
data_categories: [system]
- name: collection_name
data_categories: [system]
- name: consent_sent
data_categories: [system]
- name: created_at
data_categories: [system]
- name: data_for_erasures
data_categories: [user]
- name: dataset_name
data_categories: [user]
pattisdr marked this conversation as resolved.
Show resolved Hide resolved
- name: downstream_tasks
data_categories: [system]
- name: id
data_categories: [system]
- name: privacy_request_id
data_categories: [system]
- name: rows_masked
data_categories: [system]
- name: status
data_categories: [system]
- name: traversal_details
data_categories: [system]
- name: updated_at
data_categories: [system]
- name: upstream_tasks
data_categories: [system]
1 change: 1 addition & 0 deletions .fides/fides.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ task_retry_backoff = 1
subject_identity_verification_required = false
task_retry_count = 0
task_retry_delay = 1
use_dsr_3_0 = false
pattisdr marked this conversation as resolved.
Show resolved Hide resolved

[admin_ui]
enabled = true
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ module = [
"jose.*",
"jwt.*",
"multidimensional_urlencode.*",
"networkx.*",
"okta.*",
"pandas.*",
"plotly.*",
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ multidimensional_urlencode==0.0.4
nh3==0.2.15
okta==2.7.0
openpyxl==3.0.9
networkx==3.1 # added to help with privacy preference data migration
networkx==3.1
pattisdr marked this conversation as resolved.
Show resolved Hide resolved
packaging==23.0
pandas==1.4.3
paramiko==3.1.0
Expand Down
133 changes: 133 additions & 0 deletions src/fides/api/alembic/migrations/versions/55bedede956d_requesttask.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
"""requesttask

Revision ID: 55bedede956d
Revises: d49a767eb49d
Create Date: 2024-04-04 04:12:25.332952

"""
import sqlalchemy as sa
import sqlalchemy_utils
from alembic import op
from sqlalchemy.dialects import postgresql

# revision identifiers, used by Alembic.
revision = "55bedede956d"
down_revision = "d49a767eb49d"
branch_labels = None
depends_on = None


def upgrade():
op.create_table(
"requesttask",
sa.Column("id", sa.String(length=255), nullable=False),
sa.Column(
"created_at",
sa.DateTime(timezone=True),
server_default=sa.text("now()"),
nullable=True,
),
sa.Column(
"updated_at",
sa.DateTime(timezone=True),
server_default=sa.text("now()"),
nullable=True,
),
sa.Column("privacy_request_id", sa.String(), nullable=True),
sa.Column("collection_address", sa.String(), nullable=False),
sa.Column("dataset_name", sa.String(), nullable=False),
sa.Column("collection_name", sa.String(), nullable=False),
sa.Column("action_type", sa.String(), nullable=False),
sa.Column("status", sa.String(), nullable=False),
sa.Column(
"upstream_tasks", postgresql.JSONB(astext_type=sa.Text()), nullable=True
),
sa.Column(
"downstream_tasks", postgresql.JSONB(astext_type=sa.Text()), nullable=True
),
sa.Column(
"all_descendant_tasks",
postgresql.JSONB(astext_type=sa.Text()),
nullable=True,
),
sa.Column(
"access_data",
sqlalchemy_utils.types.encrypted.encrypted_type.StringEncryptedType(),
nullable=True,
),
sa.Column(
"data_for_erasures",
sqlalchemy_utils.types.encrypted.encrypted_type.StringEncryptedType(),
nullable=True,
),
sa.Column("rows_masked", sa.Integer(), nullable=True),
sa.Column("consent_sent", sa.Boolean(), nullable=True),
sa.Column("collection", postgresql.JSONB(astext_type=sa.Text()), nullable=True),
sa.Column(
"traversal_details", postgresql.JSONB(astext_type=sa.Text()), nullable=True
),
sa.ForeignKeyConstraint(
["privacy_request_id"], ["privacyrequest.id"], ondelete="SET NULL"
),
sa.PrimaryKeyConstraint("id"),
)
op.create_index(
op.f("ix_requesttask_action_type"), "requesttask", ["action_type"], unique=False
)
op.create_index(
op.f("ix_requesttask_collection_address"),
"requesttask",
["collection_address"],
unique=False,
)
op.create_index(
op.f("ix_requesttask_collection_name"),
"requesttask",
["collection_name"],
unique=False,
)
op.create_index(
op.f("ix_requesttask_dataset_name"),
"requesttask",
["dataset_name"],
unique=False,
)
op.create_index(op.f("ix_requesttask_id"), "requesttask", ["id"], unique=False)
op.create_index(
op.f("ix_requesttask_privacy_request_id"),
"requesttask",
["privacy_request_id"],
unique=False,
)
op.create_index(
op.f("ix_requesttask_status"), "requesttask", ["status"], unique=False
)
op.add_column(
"privacyrequest",
sa.Column(
"filtered_final_upload",
sqlalchemy_utils.types.encrypted.encrypted_type.StringEncryptedType(),
nullable=True,
),
)
op.add_column(
"privacyrequest",
sa.Column(
"access_result_urls",
sqlalchemy_utils.types.encrypted.encrypted_type.StringEncryptedType(),
nullable=True,
),
)


def downgrade():
op.drop_column("privacyrequest", "access_result_urls")
op.drop_column("privacyrequest", "filtered_final_upload")
op.drop_index(op.f("ix_requesttask_status"), table_name="requesttask")
op.drop_index(op.f("ix_requesttask_privacy_request_id"), table_name="requesttask")
op.drop_index(op.f("ix_requesttask_id"), table_name="requesttask")
op.drop_index(op.f("ix_requesttask_dataset_name"), table_name="requesttask")
op.drop_index(op.f("ix_requesttask_collection_name"), table_name="requesttask")
op.drop_index(op.f("ix_requesttask_collection_address"), table_name="requesttask")
op.drop_index(op.f("ix_requesttask_action_type"), table_name="requesttask")
op.drop_table("requesttask")
Loading
Loading