Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

Add support for Array Access Requests in MongoDB [#146][#147] #194

Merged
merged 26 commits into from
Feb 16, 2022

Conversation

pattisdr
Copy link
Contributor

@pattisdr pattisdr commented Feb 8, 2022

👉 A lot of the diff are new tests/rearranged tests/new dataset annotations/docs etc.

Purpose

Adds array support for access requests in MongoDB. Note that erasures on arrays are not addressed here and will be worked on in #129. A lot of the array-specific changes were made in code, so they will be more easily used with other types of db connections. You should not attempt to run erasures on collections with arrays yet. See the Complex Field guides here for an introduction to arrays.

Changes

Three primary pieces are involved in adding better array support - 1) Array queries should only return the matched indices or sub-documents (I actually do this in code, not as a db query). I treat arrays as embedded documents here (there is interest in making this configurable #193) 2) Consolidating array results from one collection to feed into a query on downstream collections. A target path to an array field can really be a target path to multiple results. These multiple results need to be flattened so they can be used to make a query on another collection. 3) Filtering access request results for data categories on array data. For example, array results need to be returned, or results within arrays of objects need to be returned. More specifics below:

  • Expands example mongo dataset and initialized data in MongoDB to include more complex object and array examples
  • On graph_task.access_request, after we've run an access request on a collection, I use a new filter_element_match method to only keep matched array indices and matched objects embedded in referenced array fields where applicable. I then save this modified data in the cache, and also pass it onto the next collection for use in building more queries downstream. I'm doing this in code so it will be more broadly applicable to other datastores besides MongoDB, plus this would be an expensive query to run db-side.
  • In to_dask_input_data (which takes inputs from potentially multiple collections and formats to pass into the current collection), run consolidate_query_matches to combine data from potentially multiple scalar values, arrays, or scalar values with arrays of embedded objects, into a flat list. The next collection will run a query to see if the input fields match any of the consolidated data.
  • Updates filter_data_categories (which selects relevant data categories from each record to return to the user) to be able to select matching data from arrays. We were using json_normalize after we added nested objects, but now we need something a little more custom for arrays. I'm using a new method select_and_save_field to recurse through both objects and arrays to get the targeted data.
  • Some recursive functions added to support the above, since with arrays, a target path can point to multiple values at unknown depths
  • Lots of tests, organization of tests, tests of the individual recursion pieces. tests/integration_tests/test_mongo_task.py:: test_array_querying_mongo has the most comprehensive end-to-end test checking different array use cases.

Checklist

  • Applicable documentation updated (guides, quickstart, postman collections, tutorial, fidesdemo, database diagram
  • Good unit test/integration test coverage

Ticket

Fixes #146
Fixes #147

…es we care about, add new method select_field_from_input_data that can use a FieldPath to select data we care about from input_data and add it to the "base" dictionary. This method takes into account that a FieldPath may point to data within arrays.
…in redis, to use to filter privacy request data after the fact. For example, an email from one collection might have been used to find records in another collection where an email was located in an array. Only this matched email in the array may be the relevant data the user wants to see.
…ly contain matched values. Also modify get_collection_inputs to take a privacy_request_id string, instead of the entire object for easier testing.

Say we looked up records on a collection that has at least one array containing the passenger_id 112.  So the input is:  "passenger_ids": [112].  This returned a row with passenger_ids: [111, 112, 113, and 114].  Default behavior is to just return the matched field,  "passenger_ids": [112].
…from arrays of dicts into a single array to feed into subsequent collections (flatten_and_merge_matches/append).

Restore original behavior that if a FieldPath is not found in input data, we don't return an empty dict (add new strip_empty_dicts function).
…d tests on GraphTask.to_dask_input function around more complicated nested object and array structures to verify how data is being consolidated and passed into downstream collections.

- Make customer_details.comments (array of strings) field structure more complicated to be customer_details.coments.comment_id (array of objects)  which points to mongo_test.conversations.thread.comment field
…d incoming target paths to contain indices where applicable.
…ndices where incoming field paths do not match.
…d_array_paths" to remove unmatched embedded documents in arrays or array indices that did not match incoming field paths.

- Remove "only" param from select_field_from_input_data now that this concept has moved to
- Fix bug in remove_unmatched_array_paths to loop through arrays in reverse to remove elements.
…some methods out into their own python modules. Fix type annotations.
…-to-end using new mongo dataset and mongo initialized data.
… only matched embedded documents and array values are used to build subsequent queries. I was previously doing this at the end all at once, when filtering results before returning to the user, but in some cases, that would be too late. Embedded documents for example, that didn't match, would be used to locate records in subsequent tables, causing us to potentially over-select data that wasn't relevant to the given user. (This piece will be adjusted to be configurable too).

This also means I don't have to cache the inputs to the collection, since I'm using the inputs directly after I run an access request on a node.
…sier to review - move consolidate_query_matches into its own module.
…ted object and array data and expand postman collection to include more of the mongo array edge cases that we use in the mongo_example_test_dataset.yml file.
…n, so erasures can be automatically performed. Change the default erasure to use [email protected] to parallel the access request.
# Conflicts:
#	docs/fidesops/docs/postman/Fidesops.postman_collection.json
Comment on lines 240 to 252

for row in output:
# In code, filter out non-matching sub-documents and array elements
logger.info(
f"Filtering row in {self.traversal_node.node.address} for matching array elements."
)
filter_element_match(
row,
{
FieldPath.parse(field): inputs
for field, inputs in formatted_input_data.items()
},
)
Copy link
Contributor Author

@pattisdr pattisdr Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a primary change: I filter out non-matching sub-documents and array elements in code, this is also a function that we may make configurable #193, although I think the default behavior makes sense, in that it views array data as nested documents, and only matched nested documents are used to find subsequent documents in other collections.

Comment on lines 420 to 453
# Normalize nested data into a flat dataframe
df: pd.DataFrame = pd.json_normalize(results, sep=".")
# Only keep intersection of dataframe columns and target field paths
df = df[
df.columns & [field_path.string_path for field_path in target_field_paths]
]
# Turn the filtered results back into a list of dictionaries
filtered_flattened_results: List[Dict[str, Optional[Any]]] = df.to_dict(
orient="records"
)
for row in filtered_flattened_results:
# For each row, unflatten the dictionary
filtered_access_results[node_address].append(unflatten_dict(row))
for row in results:
filtered_results: Dict[str, Any] = {}
for field_path in target_field_paths:
select_and_save_field(filtered_results, row, field_path)
remove_empty_objects(filtered_results)
filtered_access_results[node_address].append(filtered_results)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another primary change, json_normalize doesn't do what we need for array data, so I added a recursive select_and_save_field method that incrementally builds a new result with just the matched data categories on scalar, dictionary fields, array fields, or arrays of objects.

Comment on lines 169 to +174
for foreign_field_path, local_field_path in field_mappings:
new_value = foreign_field_path.retrieve_from(row)
if new_value:
append(output, local_field_path.string_path, new_value)
new_values: List = consolidate_query_matches(
row=row, target_path=foreign_field_path
)
if new_values:
append(output, local_field_path.string_path, new_values)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final primary change. Before, we were using foreign_field_path.retrieve_from(row) to get the data from the record and pass into the next collection. However, array data doesn't necessarily have a direct path, and array data can have multiple matched paths. So we get the multiple matches, and consolidate them into a single list.

Comment on lines +50 to +70
def refine_target_path(
row: Dict[str, Any], target_path: List[str], only: List[Any]
) -> DetailedPath: # Can also return a list of DetailedPaths if there are multiple matches.
"""
Recursively modify the target_path to be more detailed path(s) to the referenced data. Instead of just strings,
the path will be expanded to include indices where applicable.

:param row: Record retrieved from a dataset
:param target_path: A list of strings representing the path to the desired field.
:param only: A list of values that were used to build the query.
:return: A list or a list of lists containing more detailed path(s) to the data in "only". If there
was just one path, we return one list.

:Example:
In this example, path ["A", "B", "C"] points to two elements that match values "F" or "G". We update
the path to insert the indices to locate the appropriate value.

refine_target_path({"A": {"B": [{"C": "D"}, {"C": "F"}, {"C": "G"}]}}, ["A", "B", "C"], only=["F", "G"])

[["A", "B", 1, "C"], ["A", "B", 2, "C"]]
"""
Copy link
Contributor Author

@pattisdr pattisdr Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a lot of recursion in this PR to crawl through arrays and dictionaries of unknown depth - I generally try to avoid recursive solutions in Python - Python doesn't have tail call recursion, there's generally a way to implement it with an iterative solution, recursion is limited to 1000 function calls by default, and the language has a strong preference for clear, readable solutions. However, once arrays are introduced, a target_path may point to multiple values at unknown levels.

To help with the readable part, I've tried to add a lot of detail to the docstrings with examples. There may be ways to simplify the recursion, but a lot of these pieces were added to cover different array use cases (think of arrays of objects of objects, arrays of arrays, etc.). We're not officially supporting arrays of arrays yet, but I wanted these internal functions to keep them in mind.

@pattisdr pattisdr changed the title Add support for Array Access Requests in MongoDB [#146] Add support for Array Access Requests in MongoDB [#146][#147] Feb 9, 2022
This is an example traversal created from our test `postgres_example` and `mongo_test` datasets.
Multiple collections are point to or from complex objects and arrays. See the `mongo_example_test_dataset.yml` for more information.

![Postgres and Mongo Query Traversal](../img/mongo_and_postgres_complex.png)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used @galvana's handy methods from his saas connector PR to make this diagram!

…rect type before using them to filter out array data. Data may have been coerced from one type to another to query a collection. For example, results from one collection return integer values were used to find corresponding string values on another collection. To filter out unmatched string values in the results, we need to likewise convert the inputs to strings.

- Reuse existing method "QueryConfig.typed_filtered_values" which requires a TraversalNode - shift this method and query_field_paths to the TraversalNode itself.
- Address a few docstring issues.
…delete both empty arrays and empty dicts which can have a cascading effect.
@pattisdr
Copy link
Contributor Author

Addressing failing tests ^

Copy link
Contributor

@seanpreston seanpreston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work here @pattisdr — in particular shout out to the elegant recursive methods, and nicely organised tasks (and their corresponding tests)

@seanpreston seanpreston merged commit a321ff6 into main Feb 16, 2022
@seanpreston seanpreston deleted the fidesops_146_array_support branch February 16, 2022 18:16
@pattisdr
Copy link
Contributor Author

thank you for reviewing @seanpreston

sanders41 pushed a commit that referenced this pull request Sep 22, 2022
* Start expanding initial mongo data populated and the mongo example dataset with representative nested examples.

* Instead of using pandas json normalize to only retrieve data categories we care about,  add new method select_field_from_input_data that can use a FieldPath to select data we care about from input_data and add it to the "base" dictionary.  This method takes into account that a FieldPath may point to data within arrays.

* Cache the inputs that were used to locate records on each collection in redis, to use to filter privacy request data after the fact.  For example, an email from one collection might have been used to find records in another collection where an email was located in an array.   Only this matched email in the array may be the relevant data the user wants to see.

* Use inputs into the collection to potentially filter array data to only contain matched values. Also modify get_collection_inputs to take a privacy_request_id string, instead of the entire object for easier testing.

Say we looked up records on a collection that has at least one array containing the passenger_id 112.  So the input is:  "passenger_ids": [112].  This returned a row with passenger_ids: [111, 112, 113, and 114].  Default behavior is to just return the matched field,  "passenger_ids": [112].

* Update `to_dask_input_data` to consolidate array outputs and outputs from arrays of dicts into a single array to feed into subsequent collections (flatten_and_merge_matches/append).

Restore original behavior that if a FieldPath is not found in input data, we don't return an empty dict (add new strip_empty_dicts function).

* Don't delete empty dicts out of arrays -

* Uncomment more complex mongo dataset annotations and add more detailed tests on GraphTask.to_dask_input function around more complicated nested object and array structures to verify how data is being consolidated and passed into downstream collections.

- Make customer_details.comments (array of strings) field structure more complicated to be customer_details.coments.comment_id (array of objects)  which points to mongo_test.conversations.thread.comment field

* Add draft of build_incoming_refined_target_paths to recursively expand incoming target paths to contain indices where applicable.

* First draft of adding method to remove embedded documents and array indices where incoming field paths do not match.

* Before filtering results on data category, first run "remove_unmatched_array_paths" to remove unmatched embedded documents in arrays or array indices that did not match incoming field paths.

- Remove "only" param from select_field_from_input_data now that this concept has moved to
- Fix bug in remove_unmatched_array_paths to loop through arrays in reverse to remove elements.

* First cleanup round, reorganize/rename newly added methods, breaking some methods out into their own python modules.  Fix type annotations.

* Add more detailed tests on inner components of refine_target_path and filter_element_match.

* Add some more integration tests on accessing array data in mongo, end-to-end using new mongo dataset and mongo initialized data.

* Refactor so "filter_element_match" happens after each access request, only matched embedded documents and array values are used to build subsequent queries.  I was previously doing this at the end all at once, when filtering results before returning to the user, but in some cases, that would be too late.  Embedded documents for example, that didn't match, would be used to locate records in subsequent tables, causing us to potentially over-select data that wasn't relevant to the given user.  (This piece will be adjusted to be configurable too).

This also means I don't have to cache the inputs to the collection, since I'm using the inputs directly after I run an access request on a node.

* Move filter_data_categories back to "graph_task.py" so the diff is easier to review - move consolidate_query_matches into its own module.

* Update quickstart expected results from access request to include nested object and array data and expand postman collection to include more of the mongo array edge cases that we use in the mongo_example_test_dataset.yml file.

* Give the postgres and mongo connection configs write access in postman, so erasures can be automatically performed.   Change the default erasure to use [email protected] to parallel the access request.

* Add logging for debugging purposes.

* Add guides for working with complex data (move nested object docs, and add new array docs).

* Fix failing test - (CI is incorrectly showing green).

* Address bug related to type coercion. Cast incoming values to the correct type before using them to filter out array data.  Data may have been coerced from one type to another to query a collection. For example, results from one collection return integer values were used to find corresponding string values on another collection.  To filter out unmatched string values in the results, we need to likewise convert the inputs to strings.

- Reuse existing method "QueryConfig.typed_filtered_values" which requires a TraversalNode - shift this method and query_field_paths to the TraversalNode itself.
- Address a few docstring issues.

* Rename remove_empty_objects to remove_empty_containers and use it to delete both empty arrays and empty dicts which can have a cascading effect.

* Turn all customer ids into integers on mongo collections.

* Rephrase docstring.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants