Fix Fields Affected on Execution Logs [#144] (#236)

* Log just potentially affected fields on the execution logs, rather than all of the fields. * Add test asserting rules with overlapping data categories don't produce duplicate fields in log. * Update reporting docs to correct an inaccuracy - queries are not logged in the execution logs, we just have the status of the request and potential fields affected. - Also have the examples show the in_processing/complete execution logs instead of pending execution logs, because we don't create pending execution logs for an individual collection anymore. - Privacy requests can also have a "paused" status due to pre-execution policy webhooks * Add missing backticks around collection names.
ethyca · Mar 2, 2022 · 4202d24 · 4202d24
1 parent ff43509
commit 4202d24
Show file tree

Hide file tree

Showing 4 changed files with 341 additions and 47 deletions.
diff --git a/docs/fidesops/docs/guides/reporting.md b/docs/fidesops/docs/guides/reporting.md
@@ -3,16 +3,16 @@
 In this section we'll cover:
 
 - How to check the high-level status of your privacy requests
-- How to get more detailed execution logs of queries that were run as part of your privacy requests. 
+- How to get more detailed execution logs of collections and fields that were potentially affected as part of your privacy request.
 
 
 Take me directly to [API docs](/fidesops/api#operations-Privacy_Requests-get_request_status_api_v1_privacy_request_get).
 
 
 ## Overview
 
-The reporting feature allows you to fetch information about privacy requests. You can opt for high-level or more detailed 
-information about the individual queries executed internally.
+The reporting feature allows you to fetch information about privacy requests. You can opt for high-level status 
+information, or get more detailed information about the status of the requests on each of your collections.
 
 
 ## High-level Status
@@ -58,7 +58,7 @@ Use the following query params to further filter your privacy requests.  Filters
 `GET api/v1/privacy-request?created_gt=2021-10-01&created_lt=2021-10-05&status=pending`
 
 - id
-- status (one of `in_processing`, `pending`, `complete`, or `error`)
+- status (one of `in_processing`, `pending`, `paused`, `complete`, or `error`)
 - created_lt
 - created_gt
 - started_lt
@@ -68,6 +68,7 @@ Use the following query params to further filter your privacy requests.  Filters
 - errored_lt
 - errored_gt
 
+
 ## View All Privacy Request Logs
 
 To view all the execution logs for a Privacy Request, visit `/api/v1/privacy-request/{privacy_request_id}/logs`.
@@ -77,60 +78,64 @@ Check out the [API docs here](/fidesops/api#operations-Privacy_Requests-get_requ
 
 ## View Individual Privacy Request Log Details
 
-Use the `verbose` query param to see more details about individual queries run as part of the Privacy Request along
-with individual statuses. 
+Use the `verbose` query param to see more details about individual collections visited as part of the Privacy Request along
+with individual statuses. Individual collection statuses include `in_processing`, `retrying`, `complete` or `error`.
+You may see multiple logs for each collection as they reach different steps in the lifecycle.  
 
 `verbose` will embed a “results” key in the response, with execution logs grouped by dataset name.  In the example below,
-we have two datasets: `my-mongo-db` and `my-postgres-db`. There is one execution log for my-mongo-db and two execution
-logs for my-postgres-db.  The embedded execution logs are automatically truncated at 50 logs, so to view the entire 
-list of logs, visit the execution logs endpoint separately.
+we have two datasets: `my-mongo-db` and `my-postgres-db`. There are two execution logs for `my-mongo-db` (when the `flights` 
+collection is starting execution and when the `flights` collection has finished) and two execution
+logs for `my-postgres-db` (when the `order` collection is starting and finishing execution).  `fields_affected` are the fields
+that were potentially returned or masked based on the Rules you've specified on the Policy. The embedded execution logs 
+are automatically truncated at 50 logs, so to view the entire list of logs, visit the execution logs endpoint separately.
 
-`GET api/v1/privacy-request?verbose=True`
+`GET api/v1/privacy-request?id={privacy_request_id}&verbose=True`
 
 ```json
 {
     "items": [
         {
-            "id": "pri_5f4feff5-fb60-4286-82bd-7e0748ce90ac",
-            "created_at": "2021-10-04T17:36:32.223287+00:00",
-            "started_processing_at": "2021-10-04T17:36:37.248880+00:00",
-            "finished_processing_at": "2021-10-04T17:36:37.263121+00:00",
-            "status": "pending",
+            "id": "pri_2e0655c3-7a76-425e-8c4c-52fee32ce14b",
+            "created_at": "2022-02-28T16:38:03.878898+00:00",
+            "started_processing_at": "2022-02-28T16:38:04.021763+00:00",
+            "finished_processing_at": "2022-02-28T16:38:06.211547+00:00",
+            "status": "complete",
+            "external_id": null,
             "results": {
                 "my-mongo-db": [
                     {
-                        "collection_name": "order",
+                        "collection_name": "flights",
+                        "fields_affected": [],
+                        "message": "starting",
+                        "action_type": "access",
+                        "status": "in_processing",
+                        "updated_at": "2022-02-28T16:38:04.668513+00:00"
+                    },
+                     {
+                        "collection_name": "flights",
                         "fields_affected": [
                             {
-                                "path": "order.customer_name",
-                                "field_name": "name",
+                                "path": "mongo_test:flights:passenger_information.full_name",
+                                "field_name": "passenger_information.full_name",
                                 "data_categories": [
                                     "user.provided.identifiable.name"
                                 ]
                             }
                         ],
-                        "message": null,
+                        "message": "success",
                         "action_type": "access",
-                        "status": "pending",
-                        "updated_at": "2021-10-05T18:24:55.570430+00:00"
+                        "status": "complete",
+                        "updated_at": "2022-02-28T16:38:04.727094+00:00"
                     }
                 ],
                 "my-postgres-db": [
                     {
                         "collection_name": "order",
-                        "fields_affected": [
-                            {
-                                "path": "order.customer_name",
-                                "field_name": "name",
-                                "data_categories": [
-                                    "user.provided.identifiable.name"
-                                ]
-                            }
-                        ],
-                        "message": null,
+                        "fields_affected": [],
+                        "message": "starting",
                         "action_type": "access",
-                        "status": "pending",
-                        "updated_at": "2021-10-05T18:24:39.953914+00:00"
+                        "status": "in_processing",
+                        "updated_at": "2022-02-28T16:38:04.668513+00:00"
                     },
                     {
                         "collection_name": "order",
@@ -142,11 +147,11 @@ list of logs, visit the execution logs endpoint separately.
                                     "user.provided.identifiable.name"
                                 ]
                             }
-                        ],
-                        "message": null,
+                        ], 
+                        "message": "success",
                         "action_type": "access",
-                        "status": "pending",
-                        "updated_at": "2021-10-05T18:24:45.240612+00:00"
+                        "status": "complete",
+                        "updated_at": "2022-02-28T16:39:04.668513+00:00"
                     }
                 ]
             }

diff --git a/src/fidesops/task/graph_task.py b/src/fidesops/task/graph_task.py
@@ -18,8 +18,9 @@
     TERMINATOR_ADDRESS,
     FieldPath,
     Field,
+    FieldAddress,
 )
-from fidesops.graph.graph import Edge, DatasetGraph
+from fidesops.graph.graph import Edge, DatasetGraph, Node
 from fidesops.graph.traversal import TraversalNode, Traversal
 from fidesops.models.connectionconfig import ConnectionConfig, AccessLevel
 from fidesops.models.policy import ActionType, Policy
@@ -219,14 +220,9 @@ def log_end(
             logger.info(f"Ending {self.resources.request.id}, {self.key}")
             self.update_status(
                 "success",
-                [
-                    {
-                        "field_name": field.name,
-                        "path": f"{self.traversal_node.node.address}:{field.name}",
-                        "data_categories": field.data_categories,
-                    }
-                    for field in self.traversal_node.node.collection.field_dict.values()
-                ],
+                build_affected_field_logs(
+                    self.traversal_node.node, self.resources.policy, action_type
+                ),
                 action_type,
                 ExecutionLogStatus.complete,
             )
@@ -487,3 +483,52 @@ def termination_fn(*dependent_values: int) -> Tuple[int, ...]:
         )
 
         return erasure_update_map
+
+
+def build_affected_field_logs(
+    node: Node, policy: Policy, action_type: ActionType
+) -> List[Dict[str, Any]]:
+    """For a given node (collection), policy, and action_type (access or erasure) format all of the fields that
+    were potentially touched to be stored in the ExecutionLogs for troubleshooting.
+
+    :Example:
+    [{
+        "path": "dataset_name:collection_name:field_name",
+        "field_name": "field_name",
+        "data_categories": ["data_category_1", "data_category_2"]
+    }]
+    """
+
+    targeted_field_paths: Dict[FieldAddress, str] = {}
+
+    for rule in policy.rules:
+        if rule.action_type != action_type:
+            continue
+        rule_categories: List[str] = rule.get_target_data_categories()
+        if not rule_categories:
+            continue
+
+        collection_categories: Dict[
+            str, List[FieldPath]
+        ] = node.collection.field_paths_by_category
+        for rule_cat in rule_categories:
+            for collection_cat, field_paths in collection_categories.items():
+                if collection_cat.startswith(rule_cat):
+                    targeted_field_paths.update(
+                        {
+                            node.address.field_address(field_path): collection_cat
+                            for field_path in field_paths
+                        }
+                    )
+
+    ret: List[Dict[str, Any]] = []
+    for field_address, data_categories in targeted_field_paths.items():
+        ret.append(
+            {
+                "path": field_address.value,
+                "field_name": field_address.field_path.string_path,
+                "data_categories": [data_categories],
+            }
+        )
+
+    return ret