Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

Commit

Permalink
Add support for Array Access Requests in MongoDB [#146][#147] (#194)
Browse files Browse the repository at this point in the history
* Start expanding initial mongo data populated and the mongo example dataset with representative nested examples.

* Instead of using pandas json normalize to only retrieve data categories we care about,  add new method select_field_from_input_data that can use a FieldPath to select data we care about from input_data and add it to the "base" dictionary.  This method takes into account that a FieldPath may point to data within arrays.

* Cache the inputs that were used to locate records on each collection in redis, to use to filter privacy request data after the fact.  For example, an email from one collection might have been used to find records in another collection where an email was located in an array.   Only this matched email in the array may be the relevant data the user wants to see.

* Use inputs into the collection to potentially filter array data to only contain matched values. Also modify get_collection_inputs to take a privacy_request_id string, instead of the entire object for easier testing.

Say we looked up records on a collection that has at least one array containing the passenger_id 112.  So the input is:  "passenger_ids": [112].  This returned a row with passenger_ids: [111, 112, 113, and 114].  Default behavior is to just return the matched field,  "passenger_ids": [112].

* Update `to_dask_input_data` to consolidate array outputs and outputs from arrays of dicts into a single array to feed into subsequent collections (flatten_and_merge_matches/append).

Restore original behavior that if a FieldPath is not found in input data, we don't return an empty dict (add new strip_empty_dicts function).

* Don't delete empty dicts out of arrays -

* Uncomment more complex mongo dataset annotations and add more detailed tests on GraphTask.to_dask_input function around more complicated nested object and array structures to verify how data is being consolidated and passed into downstream collections.

- Make customer_details.comments (array of strings) field structure more complicated to be customer_details.coments.comment_id (array of objects)  which points to mongo_test.conversations.thread.comment field

* Add draft of build_incoming_refined_target_paths to recursively expand incoming target paths to contain indices where applicable.

* First draft of adding method to remove embedded documents and array indices where incoming field paths do not match.

* Before filtering results on data category, first run "remove_unmatched_array_paths" to remove unmatched embedded documents in arrays or array indices that did not match incoming field paths.

- Remove "only" param from select_field_from_input_data now that this concept has moved to
- Fix bug in remove_unmatched_array_paths to loop through arrays in reverse to remove elements.

* First cleanup round, reorganize/rename newly added methods, breaking some methods out into their own python modules.  Fix type annotations.

* Add more detailed tests on inner components of refine_target_path and filter_element_match.

* Add some more integration tests on accessing array data in mongo, end-to-end using new mongo dataset and mongo initialized data.

* Refactor so "filter_element_match" happens after each access request, only matched embedded documents and array values are used to build subsequent queries.  I was previously doing this at the end all at once, when filtering results before returning to the user, but in some cases, that would be too late.  Embedded documents for example, that didn't match, would be used to locate records in subsequent tables, causing us to potentially over-select data that wasn't relevant to the given user.  (This piece will be adjusted to be configurable too).

This also means I don't have to cache the inputs to the collection, since I'm using the inputs directly after I run an access request on a node.

* Move filter_data_categories back to "graph_task.py" so the diff is easier to review - move consolidate_query_matches into its own module.

* Update quickstart expected results from access request to include nested object and array data and expand postman collection to include more of the mongo array edge cases that we use in the mongo_example_test_dataset.yml file.

* Give the postgres and mongo connection configs write access in postman, so erasures can be automatically performed.   Change the default erasure to use [email protected] to parallel the access request.

* Add logging for debugging purposes.

* Add guides for working with complex data (move nested object docs, and add new array docs).

* Fix failing test - (CI is incorrectly showing green).

* Address bug related to type coercion. Cast incoming values to the correct type before using them to filter out array data.  Data may have been coerced from one type to another to query a collection. For example, results from one collection return integer values were used to find corresponding string values on another collection.  To filter out unmatched string values in the results, we need to likewise convert the inputs to strings.

- Reuse existing method "QueryConfig.typed_filtered_values" which requires a TraversalNode - shift this method and query_field_paths to the TraversalNode itself.
- Address a few docstring issues.

* Rename remove_empty_objects to remove_empty_containers and use it to delete both empty arrays and empty dicts which can have a cascading effect.

* Turn all customer ids into integers on mongo collections.

* Rephrase docstring.
  • Loading branch information
pattisdr authored Feb 16, 2022
1 parent ff77f38 commit 2c05356
Show file tree
Hide file tree
Showing 33 changed files with 3,395 additions and 886 deletions.
99 changes: 71 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,35 +55,78 @@ information for Jane across tables in both the postgres and mongo databases.

```json
{
"postgres_example_test_dataset:customer": [
{
"email": "[email protected]",
"name": "Jane Customer"
}
],
"postgres_example_test_dataset:address": [
{
"city": "Example Mountain",
"house": 1111,
"state": "TX",
"street": "Example Place",
"zip": "54321"
}
],
"postgres_example_test_dataset:payment_card": [
{
"ccn": 373719391,
"code": 222,
"name": "Example Card 3"
}
],
"mongo_test:customer_details": [
{
"gender": "female",
"birthday": "1990-02-28T00:00:00"
}
]
"mongo_test:flights": [
{
"passenger_information": {
"full_name": "Jane Customer"
}
}
],
"mongo_test:payment_card": [
{
"ccn": "987654321",
"name": "Example Card 2",
"code": "123"
}
],
"postgres_example_test_dataset:address": [
{
"zip": "54321",
"street": "Example Place",
"state": "TX",
"city": "Example Mountain",
"house": 1111
}
],
"mongo_test:customer_details": [
{
"birthday": "1990-02-28T00:00:00",
"gender": "female",
"children": [
"Erica Example"
]
}
],
"postgres_example_test_dataset:customer": [
{
"email": "[email protected]",
"name": "Jane Customer"
}
],
"postgres_example_test_dataset:payment_card": [
{
"ccn": 373719391,
"name": "Example Card 3",
"code": 222
}
],
"mongo_test:employee": [
{
"email": "[email protected]",
"name": "Jane Employee"
}
],
"mongo_test:conversations": [
{
"thread": [
{
"chat_name": "Jane C"
}
]
},
{
"thread": [
{
"chat_name": "Jane C"
},
{
"chat_name": "Jane C"
}
]
}
]
}

```

### Step Four: Create an Erasure Policy
Expand Down
157 changes: 157 additions & 0 deletions data/dataset/mongo_example_test_dataset.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ dataset:
fidesops_meta:
data_type: string
- name: workplace_info
fidesops_meta:
data_type: object
fields:
- name: employer
fidesops_meta:
Expand All @@ -33,8 +35,51 @@ dataset:
data_categories: [ user.provided.identifiable.job_title ]
fidesops_meta:
data_type: string
- name: direct_reports
data_categories: [ user.provided.identifiable.name ]
fidesops_meta:
data_type: string[]
- name: emergency_contacts
fidesops_meta:
data_type: object[]
fields:
- name: name
data_categories: [ user.provided.identifiable.name ]
fidesops_meta:
data_type: string
- name: relationship
fidesops_meta:
data_type: string
- name: phone
data_categories: [ user.provided.identifiable.contact.phone_number ]
fidesops_meta:
data_type: string
- name: children
data_categories: [ user.provided.identifiable.childrens ]
fidesops_meta:
data_type: string[]
- name: travel_identifiers
fidesops_meta:
data_type: string[]
data_categories: [system.operations]
- name: comments
fidesops_meta:
data_type: object[]
fields:
- name: comment_id
fidesops_meta:
data_type: string
references:
- dataset: mongo_test
field: conversations.thread.comment
direction: to
- name: internal_customer_profile
fields:
- name: _id
data_categories: [ system.operations ]
fidesops_meta:
primary_key: True
data_type: object_id
- name: customer_identifiers
fields:
- name: internal_id
Expand All @@ -44,6 +89,11 @@ dataset:
- dataset: mongo_test
field: customer_feedback.customer_information.internal_customer_id
direction: from
- name: derived_emails
data_categories: [user.derived]
fidesops_meta:
data_type: string[]
identity: email
- name: derived_interests
data_categories: [ user.derived ]
fidesops_meta:
Expand Down Expand Up @@ -81,3 +131,110 @@ dataset:
data_categories: [ user.provided.nonidentifiable ]
fidesops_meta:
data_type: string
- name: flights
fields:
- name: _id
data_categories: [ system.operations ]
fidesops_meta:
primary_key: True
data_type: object_id
- name: passenger_information
fields:
- name: passenger_ids
fidesops_meta:
data_type: string[]
references:
- dataset: mongo_test
field: customer_details.travel_identifiers
direction: from
- name: full_name
data_categories: [user.provided.identifiable.name]
fidesops_meta:
data_type: string
- name: flight_no
- name: date
- name: pilots
data_categories: [ system.operations ]
fidesops_meta:
data_type: string[]
- name: plane
data_categories: [ system.operations ]
fidesops_meta:
data_type: integer
- name: conversations
fidesops_meta:
data_type: object[]
fields:
- name: thread
fields:
- name: comment
fidesops_meta:
data_type: string
- name: message
fidesops_meta:
data_type: string
- name: chat_name
data_categories: [ user.provided.identifiable.name ]
fidesops_meta:
data_type: string
- name: employee
fields:
- name: email
data_categories: [ user.provided.identifiable.contact.email ]
fidesops_meta:
identity: email
data_type: string
- name: id
data_categories: [ user.derived.identifiable.unique_id ]
fidesops_meta:
primary_key: True
references:
- dataset: mongo_test
field: flights.pilots
direction: from
- name: name
data_categories: [ user.provided.identifiable.name ]
fidesops_meta:
data_type: string
- name: aircraft
fields:
- name: _id
data_categories: [ system.operations ]
fidesops_meta:
primary_key: True
data_type: object_id
- name: planes
data_categories: [ system.operations ]
fidesops_meta:
data_type: string[]
references:
- dataset: mongo_test
field: flights.plane
direction: from
- name: model
data_categories: [ system.operations ]
fidesops_meta:
data_type: string
- name: payment_card
fields:
- name: billing_address_id
data_categories: [ system.operations ]
- name: ccn
data_categories: [ user.provided.identifiable.financial.account_number ]
fidesops_meta:
references:
- dataset: mongo_test
field: conversations.thread.ccn
direction: from
- name: code
data_categories: [ user.provided.identifiable.financial ]
- name: customer_id
data_categories: [ user.derived.identifiable.unique_id ]
- name: id
data_categories: [ system.operations ]
fidesops_meta:
primary_key: True
- name: name
data_categories: [ user.provided.identifiable.financial ]
- name: preferred
data_categories: [ user.provided.nonidentifiable ]
Loading

0 comments on commit 2c05356

Please sign in to comment.