Add support for Array Access Requests in MongoDB [#146][#147] (#194)

* Start expanding initial mongo data populated and the mongo example dataset with representative nested examples. * Instead of using pandas json normalize to only retrieve data categories we care about, add new method select_field_from_input_data that can use a FieldPath to select data we care about from input_data and add it to the "base" dictionary. This method takes into account that a FieldPath may point to data within arrays. * Cache the inputs that were used to locate records on each collection in redis, to use to filter privacy request data after the fact. For example, an email from one collection might have been used to find records in another collection where an email was located in an array. Only this matched email in the array may be the relevant data the user wants to see. * Use inputs into the collection to potentially filter array data to only contain matched values. Also modify get_collection_inputs to take a privacy_request_id string, instead of the entire object for easier testing. Say we looked up records on a collection that has at least one array containing the passenger_id 112. So the input is: "passenger_ids": [112]. This returned a row with passenger_ids: [111, 112, 113, and 114]. Default behavior is to just return the matched field, "passenger_ids": [112]. * Update `to_dask_input_data` to consolidate array outputs and outputs from arrays of dicts into a single array to feed into subsequent collections (flatten_and_merge_matches/append). Restore original behavior that if a FieldPath is not found in input data, we don't return an empty dict (add new strip_empty_dicts function). * Don't delete empty dicts out of arrays - * Uncomment more complex mongo dataset annotations and add more detailed tests on GraphTask.to_dask_input function around more complicated nested object and array structures to verify how data is being consolidated and passed into downstream collections. - Make customer_details.comments (array of strings) field structure more complicated to be customer_details.coments.comment_id (array of objects) which points to mongo_test.conversations.thread.comment field * Add draft of build_incoming_refined_target_paths to recursively expand incoming target paths to contain indices where applicable. * First draft of adding method to remove embedded documents and array indices where incoming field paths do not match. * Before filtering results on data category, first run "remove_unmatched_array_paths" to remove unmatched embedded documents in arrays or array indices that did not match incoming field paths. - Remove "only" param from select_field_from_input_data now that this concept has moved to - Fix bug in remove_unmatched_array_paths to loop through arrays in reverse to remove elements. * First cleanup round, reorganize/rename newly added methods, breaking some methods out into their own python modules. Fix type annotations. * Add more detailed tests on inner components of refine_target_path and filter_element_match. * Add some more integration tests on accessing array data in mongo, end-to-end using new mongo dataset and mongo initialized data. * Refactor so "filter_element_match" happens after each access request, only matched embedded documents and array values are used to build subsequent queries. I was previously doing this at the end all at once, when filtering results before returning to the user, but in some cases, that would be too late. Embedded documents for example, that didn't match, would be used to locate records in subsequent tables, causing us to potentially over-select data that wasn't relevant to the given user. (This piece will be adjusted to be configurable too). This also means I don't have to cache the inputs to the collection, since I'm using the inputs directly after I run an access request on a node. * Move filter_data_categories back to "graph_task.py" so the diff is easier to review - move consolidate_query_matches into its own module. * Update quickstart expected results from access request to include nested object and array data and expand postman collection to include more of the mongo array edge cases that we use in the mongo_example_test_dataset.yml file. * Give the postgres and mongo connection configs write access in postman, so erasures can be automatically performed. Change the default erasure to use [email protected] to parallel the access request. * Add logging for debugging purposes. * Add guides for working with complex data (move nested object docs, and add new array docs). * Fix failing test - (CI is incorrectly showing green). * Address bug related to type coercion. Cast incoming values to the correct type before using them to filter out array data. Data may have been coerced from one type to another to query a collection. For example, results from one collection return integer values were used to find corresponding string values on another collection. To filter out unmatched string values in the results, we need to likewise convert the inputs to strings. - Reuse existing method "QueryConfig.typed_filtered_values" which requires a TraversalNode - shift this method and query_field_paths to the TraversalNode itself. - Address a few docstring issues. * Rename remove_empty_objects to remove_empty_containers and use it to delete both empty arrays and empty dicts which can have a cascading effect. * Turn all customer ids into integers on mongo collections. * Rephrase docstring.
ethyca · Feb 16, 2022 · a321ff6 · a321ff6
1 parent 84f57c3
commit a321ff6
Show file tree

Hide file tree

Showing 33 changed files with 3,395 additions and 886 deletions.
diff --git a/README.md b/README.md
@@ -55,35 +55,78 @@ information for Jane across tables in both the postgres and mongo databases.
 
 ```json
 {
-   "postgres_example_test_dataset:customer": [
-      {
-         "email": "[email protected]",
-         "name": "Jane Customer"
-      }
-   ],
-   "postgres_example_test_dataset:address": [
-      {
-         "city": "Example Mountain",
-         "house": 1111,
-         "state": "TX",
-         "street": "Example Place",
-         "zip": "54321"
-      }
-   ],
-   "postgres_example_test_dataset:payment_card": [
-      {
-         "ccn": 373719391,
-         "code": 222,
-         "name": "Example Card 3"
-      }
-   ],
-   "mongo_test:customer_details": [
-      {
-         "gender": "female",
-         "birthday": "1990-02-28T00:00:00"
-      }
-   ]
+    "mongo_test:flights": [
+        {
+            "passenger_information": {
+                "full_name": "Jane Customer"
+            }
+        }
+    ],
+    "mongo_test:payment_card": [
+        {
+            "ccn": "987654321",
+            "name": "Example Card 2",
+            "code": "123"
+        }
+    ],
+    "postgres_example_test_dataset:address": [
+        {
+            "zip": "54321",
+            "street": "Example Place",
+            "state": "TX",
+            "city": "Example Mountain",
+            "house": 1111
+        }
+    ],
+    "mongo_test:customer_details": [
+        {
+            "birthday": "1990-02-28T00:00:00",
+            "gender": "female",
+            "children": [
+                "Erica Example"
+            ]
+        }
+    ],
+    "postgres_example_test_dataset:customer": [
+        {
+            "email": "[email protected]",
+            "name": "Jane Customer"
+        }
+    ],
+    "postgres_example_test_dataset:payment_card": [
+        {
+            "ccn": 373719391,
+            "name": "Example Card 3",
+            "code": 222
+        }
+    ],
+    "mongo_test:employee": [
+        {
+            "email": "[email protected]",
+            "name": "Jane Employee"
+        }
+    ],
+    "mongo_test:conversations": [
+        {
+            "thread": [
+                {
+                    "chat_name": "Jane C"
+                }
+            ]
+        },
+        {
+            "thread": [
+                {
+                    "chat_name": "Jane C"
+                },
+                {
+                    "chat_name": "Jane C"
+                }
+            ]
+        }
+    ]
 }
+
 ```
 
 ### Step Four: Create an Erasure Policy

diff --git a/data/dataset/mongo_example_test_dataset.yml b/data/dataset/mongo_example_test_dataset.yml
@@ -25,6 +25,8 @@ dataset:
             fidesops_meta:
               data_type: string
           - name: workplace_info
+            fidesops_meta:
+              data_type: object
             fields:
               - name: employer
                 fidesops_meta:
@@ -33,8 +35,51 @@ dataset:
                 data_categories: [ user.provided.identifiable.job_title ]
                 fidesops_meta:
                   data_type: string
+              - name: direct_reports
+                data_categories: [ user.provided.identifiable.name ]
+                fidesops_meta:
+                  data_type: string[]
+          - name: emergency_contacts
+            fidesops_meta:
+              data_type: object[]
+            fields:
+              - name: name
+                data_categories: [ user.provided.identifiable.name ]
+                fidesops_meta:
+                  data_type: string
+              - name: relationship
+                fidesops_meta:
+                  data_type: string
+              - name: phone
+                data_categories: [ user.provided.identifiable.contact.phone_number ]
+                fidesops_meta:
+                  data_type: string
+          - name: children
+            data_categories: [ user.provided.identifiable.childrens ]
+            fidesops_meta:
+              data_type: string[]
+          - name: travel_identifiers
+            fidesops_meta:
+              data_type: string[]
+              data_categories: [system.operations]
+          - name: comments
+            fidesops_meta:
+              data_type: object[]
+            fields:
+              - name: comment_id
+                fidesops_meta:
+                  data_type: string
+                  references:
+                    - dataset: mongo_test
+                      field: conversations.thread.comment
+                      direction: to
       - name: internal_customer_profile
         fields:
+          - name: _id
+            data_categories: [ system.operations ]
+            fidesops_meta:
+              primary_key: True
+              data_type: object_id
           - name: customer_identifiers
             fields:
               - name: internal_id
@@ -44,6 +89,11 @@ dataset:
                     - dataset: mongo_test
                       field: customer_feedback.customer_information.internal_customer_id
                       direction: from
+              - name: derived_emails
+                data_categories: [user.derived]
+                fidesops_meta:
+                  data_type: string[]
+                  identity: email
           - name: derived_interests
             data_categories: [ user.derived ]
             fidesops_meta:
@@ -81,3 +131,110 @@ dataset:
             data_categories: [ user.provided.nonidentifiable ]
             fidesops_meta:
               data_type: string
+      - name: flights
+        fields:
+          - name: _id
+            data_categories: [ system.operations ]
+            fidesops_meta:
+              primary_key: True
+              data_type: object_id
+          - name: passenger_information
+            fields:
+              - name: passenger_ids
+                fidesops_meta:
+                  data_type: string[]
+                  references:
+                    - dataset: mongo_test
+                      field: customer_details.travel_identifiers
+                      direction: from
+              - name: full_name
+                data_categories: [user.provided.identifiable.name]
+                fidesops_meta:
+                  data_type: string
+          - name: flight_no
+          - name: date
+          - name: pilots
+            data_categories: [ system.operations ]
+            fidesops_meta:
+              data_type: string[]
+          - name: plane
+            data_categories: [ system.operations ]
+            fidesops_meta:
+              data_type: integer
+      - name: conversations
+        fidesops_meta:
+          data_type: object[]
+        fields:
+          - name: thread
+            fields:
+              - name: comment
+                fidesops_meta:
+                  data_type: string
+              - name: message
+                fidesops_meta:
+                  data_type: string
+              - name: chat_name
+                data_categories: [ user.provided.identifiable.name ]
+                fidesops_meta:
+                  data_type: string
+      - name: employee
+        fields:
+          - name: email
+            data_categories: [ user.provided.identifiable.contact.email ]
+            fidesops_meta:
+              identity: email
+              data_type: string
+          - name: id
+            data_categories: [ user.derived.identifiable.unique_id ]
+            fidesops_meta:
+              primary_key: True
+              references:
+                - dataset: mongo_test
+                  field: flights.pilots
+                  direction: from
+          - name: name
+            data_categories: [ user.provided.identifiable.name ]
+            fidesops_meta:
+              data_type: string
+      - name: aircraft
+        fields:
+          - name: _id
+            data_categories: [ system.operations ]
+            fidesops_meta:
+              primary_key: True
+              data_type: object_id
+          - name: planes
+            data_categories: [ system.operations ]
+            fidesops_meta:
+              data_type: string[]
+              references:
+                - dataset: mongo_test
+                  field: flights.plane
+                  direction: from
+          - name: model
+            data_categories: [ system.operations ]
+            fidesops_meta:
+              data_type: string
+      - name: payment_card
+        fields:
+          - name: billing_address_id
+            data_categories: [ system.operations ]
+          - name: ccn
+            data_categories: [ user.provided.identifiable.financial.account_number ]
+            fidesops_meta:
+              references:
+                - dataset: mongo_test
+                  field: conversations.thread.ccn
+                  direction: from
+          - name: code
+            data_categories: [ user.provided.identifiable.financial ]
+          - name: customer_id
+            data_categories: [ user.derived.identifiable.unique_id ]
+          - name: id
+            data_categories: [ system.operations ]
+            fidesops_meta:
+              primary_key: True
+          - name: name
+            data_categories: [ user.provided.identifiable.financial ]
+          - name: preferred
+            data_categories: [ user.provided.nonidentifiable ]