Add support for Array Access Requests in MongoDB [#146][#147] #194

pattisdr · 2022-02-08T23:28:09Z

👉 A lot of the diff are new tests/rearranged tests/new dataset annotations/docs etc.

Purpose

Adds array support for access requests in MongoDB. Note that erasures on arrays are not addressed here and will be worked on in #129. A lot of the array-specific changes were made in code, so they will be more easily used with other types of db connections. You should not attempt to run erasures on collections with arrays yet. See the Complex Field guides here for an introduction to arrays.

Changes

Three primary pieces are involved in adding better array support - 1) Array queries should only return the matched indices or sub-documents (I actually do this in code, not as a db query). I treat arrays as embedded documents here (there is interest in making this configurable #193) 2) Consolidating array results from one collection to feed into a query on downstream collections. A target path to an array field can really be a target path to multiple results. These multiple results need to be flattened so they can be used to make a query on another collection. 3) Filtering access request results for data categories on array data. For example, array results need to be returned, or results within arrays of objects need to be returned. More specifics below:

Expands example mongo dataset and initialized data in MongoDB to include more complex object and array examples
On graph_task.access_request, after we've run an access request on a collection, I use a new filter_element_match method to only keep matched array indices and matched objects embedded in referenced array fields where applicable. I then save this modified data in the cache, and also pass it onto the next collection for use in building more queries downstream. I'm doing this in code so it will be more broadly applicable to other datastores besides MongoDB, plus this would be an expensive query to run db-side.
In to_dask_input_data (which takes inputs from potentially multiple collections and formats to pass into the current collection), run consolidate_query_matches to combine data from potentially multiple scalar values, arrays, or scalar values with arrays of embedded objects, into a flat list. The next collection will run a query to see if the input fields match any of the consolidated data.
Updates filter_data_categories (which selects relevant data categories from each record to return to the user) to be able to select matching data from arrays. We were using json_normalize after we added nested objects, but now we need something a little more custom for arrays. I'm using a new method select_and_save_field to recurse through both objects and arrays to get the targeted data.
Some recursive functions added to support the above, since with arrays, a target path can point to multiple values at unknown depths
Lots of tests, organization of tests, tests of the individual recursion pieces. tests/integration_tests/test_mongo_task.py:: test_array_querying_mongo has the most comprehensive end-to-end test checking different array use cases.

Checklist

Applicable documentation updated (guides, quickstart, postman collections, tutorial, fidesdemo, database diagram
Good unit test/integration test coverage

Ticket

Fixes #146
Fixes #147

…taset with representative nested examples.

…es we care about, add new method select_field_from_input_data that can use a FieldPath to select data we care about from input_data and add it to the "base" dictionary. This method takes into account that a FieldPath may point to data within arrays.

…in redis, to use to filter privacy request data after the fact. For example, an email from one collection might have been used to find records in another collection where an email was located in an array. Only this matched email in the array may be the relevant data the user wants to see.

…ly contain matched values. Also modify get_collection_inputs to take a privacy_request_id string, instead of the entire object for easier testing. Say we looked up records on a collection that has at least one array containing the passenger_id 112. So the input is: "passenger_ids": [112]. This returned a row with passenger_ids: [111, 112, 113, and 114]. Default behavior is to just return the matched field, "passenger_ids": [112].

…from arrays of dicts into a single array to feed into subsequent collections (flatten_and_merge_matches/append). Restore original behavior that if a FieldPath is not found in input data, we don't return an empty dict (add new strip_empty_dicts function).

…d tests on GraphTask.to_dask_input function around more complicated nested object and array structures to verify how data is being consolidated and passed into downstream collections. - Make customer_details.comments (array of strings) field structure more complicated to be customer_details.coments.comment_id (array of objects) which points to mongo_test.conversations.thread.comment field

…d incoming target paths to contain indices where applicable.

…ndices where incoming field paths do not match.

…d_array_paths" to remove unmatched embedded documents in arrays or array indices that did not match incoming field paths. - Remove "only" param from select_field_from_input_data now that this concept has moved to - Fix bug in remove_unmatched_array_paths to loop through arrays in reverse to remove elements.

…some methods out into their own python modules. Fix type annotations.

… filter_element_match.

…-to-end using new mongo dataset and mongo initialized data.

… only matched embedded documents and array values are used to build subsequent queries. I was previously doing this at the end all at once, when filtering results before returning to the user, but in some cases, that would be too late. Embedded documents for example, that didn't match, would be used to locate records in subsequent tables, causing us to potentially over-select data that wasn't relevant to the given user. (This piece will be adjusted to be configurable too). This also means I don't have to cache the inputs to the collection, since I'm using the inputs directly after I run an access request on a node.

…sier to review - move consolidate_query_matches into its own module.

…ted object and array data and expand postman collection to include more of the mongo array edge cases that we use in the mongo_example_test_dataset.yml file.

…n, so erasures can be automatically performed. Change the default erasure to use [email protected] to parallel the access request.

…d add new array docs).

# Conflicts: # docs/fidesops/docs/postman/Fidesops.postman_collection.json

pattisdr · 2022-02-08T23:35:36Z

src/fidesops/task/graph_task.py

+
+        for row in output:
+            # In code, filter out non-matching sub-documents and array elements
+            logger.info(
+                f"Filtering row in {self.traversal_node.node.address} for matching array elements."
+            )
+            filter_element_match(
+                row,
+                {
+                    FieldPath.parse(field): inputs
+                    for field, inputs in formatted_input_data.items()
+                },
+            )


This is a primary change: I filter out non-matching sub-documents and array elements in code, this is also a function that we may make configurable #193, although I think the default behavior makes sense, in that it views array data as nested documents, and only matched nested documents are used to find subsequent documents in other collections.

pattisdr · 2022-02-08T23:36:43Z

src/fidesops/task/graph_task.py

-        # Normalize nested data into a flat dataframe
-        df: pd.DataFrame = pd.json_normalize(results, sep=".")
-        # Only keep intersection of dataframe columns and target field paths
-        df = df[
-            df.columns & [field_path.string_path for field_path in target_field_paths]
-        ]
-        # Turn the filtered results back into a list of dictionaries
-        filtered_flattened_results: List[Dict[str, Optional[Any]]] = df.to_dict(
-            orient="records"
-        )
-        for row in filtered_flattened_results:
-            # For each row, unflatten the dictionary
-            filtered_access_results[node_address].append(unflatten_dict(row))
+        for row in results:
+            filtered_results: Dict[str, Any] = {}
+            for field_path in target_field_paths:
+                select_and_save_field(filtered_results, row, field_path)
+            remove_empty_objects(filtered_results)
+            filtered_access_results[node_address].append(filtered_results)


This is another primary change, json_normalize doesn't do what we need for array data, so I added a recursive select_and_save_field method that incrementally builds a new result with just the matched data categories on scalar, dictionary fields, array fields, or arrays of objects.

pattisdr · 2022-02-08T23:38:24Z

src/fidesops/task/graph_task.py

                for foreign_field_path, local_field_path in field_mappings:
-                    new_value = foreign_field_path.retrieve_from(row)
-                    if new_value:
-                        append(output, local_field_path.string_path, new_value)
+                    new_values: List = consolidate_query_matches(
+                        row=row, target_path=foreign_field_path
+                    )
+                    if new_values:
+                        append(output, local_field_path.string_path, new_values)


This is the final primary change. Before, we were using foreign_field_path.retrieve_from(row) to get the data from the record and pass into the next collection. However, array data doesn't necessarily have a direct path, and array data can have multiple matched paths. So we get the multiple matches, and consolidate them into a single list.

pattisdr · 2022-02-09T00:02:38Z

src/fidesops/task/refine_target_path.py

+def refine_target_path(
+    row: Dict[str, Any], target_path: List[str], only: List[Any]
+) -> DetailedPath:  # Can also return a list of DetailedPaths if there are multiple matches.
+    """
+    Recursively modify the target_path to be more detailed path(s) to the referenced data. Instead of just strings,
+    the path will be expanded to include indices where applicable.
+
+    :param row: Record retrieved from a dataset
+    :param target_path: A list of strings representing the path to the desired field.
+    :param only: A list of values that were used to build the query.
+    :return: A list or a list of lists containing more detailed path(s) to the data in "only".  If there
+    was just one path, we return one list.
+
+    :Example:
+    In this example, path ["A", "B", "C"] points to two elements that match values "F" or "G".  We update
+    the path to insert the indices to locate the appropriate value.
+
+    refine_target_path({"A": {"B": [{"C": "D"}, {"C": "F"}, {"C": "G"}]}}, ["A", "B", "C"], only=["F", "G"])
+
+    [["A", "B", 1, "C"], ["A", "B", 2, "C"]]
+    """


I use a lot of recursion in this PR to crawl through arrays and dictionaries of unknown depth - I generally try to avoid recursive solutions in Python - Python doesn't have tail call recursion, there's generally a way to implement it with an iterative solution, recursion is limited to 1000 function calls by default, and the language has a strong preference for clear, readable solutions. However, once arrays are introduced, a target_path may point to multiple values at unknown levels.

To help with the readable part, I've tried to add a lot of detail to the docstrings with examples. There may be ways to simplify the recursion, but a lot of these pieces were added to cover different array use cases (think of arrays of objects of objects, arrays of arrays, etc.). We're not officially supporting arrays of arrays yet, but I wanted these internal functions to keep them in mind.

pattisdr · 2022-02-09T14:37:24Z

docs/fidesops/docs/guides/complex_fields.md

+This is an example traversal created from our test `postgres_example` and `mongo_test` datasets.
+Multiple collections are point to or from complex objects and arrays.  See the `mongo_example_test_dataset.yml` for more information.
+
+![Postgres and Mongo Query Traversal](../img/mongo_and_postgres_complex.png)


Used @galvana's handy methods from his saas connector PR to make this diagram!

…pport

src/fidesops/task/refine_target_path.py

tests/integration_tests/test_mongo_task.py

src/fidesops/task/filter_element_match.py

src/fidesops/task/filter_results.py

data/nosql/mongo-init.js

…rect type before using them to filter out array data. Data may have been coerced from one type to another to query a collection. For example, results from one collection return integer values were used to find corresponding string values on another collection. To filter out unmatched string values in the results, we need to likewise convert the inputs to strings. - Reuse existing method "QueryConfig.typed_filtered_values" which requires a TraversalNode - shift this method and query_field_paths to the TraversalNode itself. - Address a few docstring issues.

tests/integration_tests/test_mongo_task.py

tests/task/test_consolidate_query_matches.py

…delete both empty arrays and empty dicts which can have a cascading effect.

pattisdr · 2022-02-15T22:44:29Z

Addressing failing tests ^

src/fidesops/task/filter_results.py

seanpreston

Thanks for the hard work here @pattisdr — in particular shout out to the elegant recursive methods, and nicely organised tasks (and their corresponding tests)

pattisdr · 2022-02-16T18:19:01Z

thank you for reviewing @seanpreston

* Start expanding initial mongo data populated and the mongo example dataset with representative nested examples. * Instead of using pandas json normalize to only retrieve data categories we care about, add new method select_field_from_input_data that can use a FieldPath to select data we care about from input_data and add it to the "base" dictionary. This method takes into account that a FieldPath may point to data within arrays. * Cache the inputs that were used to locate records on each collection in redis, to use to filter privacy request data after the fact. For example, an email from one collection might have been used to find records in another collection where an email was located in an array. Only this matched email in the array may be the relevant data the user wants to see. * Use inputs into the collection to potentially filter array data to only contain matched values. Also modify get_collection_inputs to take a privacy_request_id string, instead of the entire object for easier testing. Say we looked up records on a collection that has at least one array containing the passenger_id 112. So the input is: "passenger_ids": [112]. This returned a row with passenger_ids: [111, 112, 113, and 114]. Default behavior is to just return the matched field, "passenger_ids": [112]. * Update `to_dask_input_data` to consolidate array outputs and outputs from arrays of dicts into a single array to feed into subsequent collections (flatten_and_merge_matches/append). Restore original behavior that if a FieldPath is not found in input data, we don't return an empty dict (add new strip_empty_dicts function). * Don't delete empty dicts out of arrays - * Uncomment more complex mongo dataset annotations and add more detailed tests on GraphTask.to_dask_input function around more complicated nested object and array structures to verify how data is being consolidated and passed into downstream collections. - Make customer_details.comments (array of strings) field structure more complicated to be customer_details.coments.comment_id (array of objects) which points to mongo_test.conversations.thread.comment field * Add draft of build_incoming_refined_target_paths to recursively expand incoming target paths to contain indices where applicable. * First draft of adding method to remove embedded documents and array indices where incoming field paths do not match. * Before filtering results on data category, first run "remove_unmatched_array_paths" to remove unmatched embedded documents in arrays or array indices that did not match incoming field paths. - Remove "only" param from select_field_from_input_data now that this concept has moved to - Fix bug in remove_unmatched_array_paths to loop through arrays in reverse to remove elements. * First cleanup round, reorganize/rename newly added methods, breaking some methods out into their own python modules. Fix type annotations. * Add more detailed tests on inner components of refine_target_path and filter_element_match. * Add some more integration tests on accessing array data in mongo, end-to-end using new mongo dataset and mongo initialized data. * Refactor so "filter_element_match" happens after each access request, only matched embedded documents and array values are used to build subsequent queries. I was previously doing this at the end all at once, when filtering results before returning to the user, but in some cases, that would be too late. Embedded documents for example, that didn't match, would be used to locate records in subsequent tables, causing us to potentially over-select data that wasn't relevant to the given user. (This piece will be adjusted to be configurable too). This also means I don't have to cache the inputs to the collection, since I'm using the inputs directly after I run an access request on a node. * Move filter_data_categories back to "graph_task.py" so the diff is easier to review - move consolidate_query_matches into its own module. * Update quickstart expected results from access request to include nested object and array data and expand postman collection to include more of the mongo array edge cases that we use in the mongo_example_test_dataset.yml file. * Give the postgres and mongo connection configs write access in postman, so erasures can be automatically performed. Change the default erasure to use [email protected] to parallel the access request. * Add logging for debugging purposes. * Add guides for working with complex data (move nested object docs, and add new array docs). * Fix failing test - (CI is incorrectly showing green). * Address bug related to type coercion. Cast incoming values to the correct type before using them to filter out array data. Data may have been coerced from one type to another to query a collection. For example, results from one collection return integer values were used to find corresponding string values on another collection. To filter out unmatched string values in the results, we need to likewise convert the inputs to strings. - Reuse existing method "QueryConfig.typed_filtered_values" which requires a TraversalNode - shift this method and query_field_paths to the TraversalNode itself. - Address a few docstring issues. * Rename remove_empty_objects to remove_empty_containers and use it to delete both empty arrays and empty dicts which can have a cascading effect. * Turn all customer ids into integers on mongo collections. * Rephrase docstring.

pattisdr added 20 commits January 28, 2022 13:19

Start expanding initial mongo data populated and the mongo example da…

534f0ed

…taset with representative nested examples.

Don't delete empty dicts out of arrays -

e0669e5

Add draft of build_incoming_refined_target_paths to recursively expan…

9854c34

…d incoming target paths to contain indices where applicable.

First draft of adding method to remove embedded documents and array i…

48bb90b

…ndices where incoming field paths do not match.

First cleanup round, reorganize/rename newly added methods, breaking …

ff87403

…some methods out into their own python modules. Fix type annotations.

Add more detailed tests on inner components of refine_target_path and…

ec4ca99

… filter_element_match.

Add some more integration tests on accessing array data in mongo, end…

4c8131b

…-to-end using new mongo dataset and mongo initialized data.

Move filter_data_categories back to "graph_task.py" so the diff is ea…

9c9e650

…sier to review - move consolidate_query_matches into its own module.

Update quickstart expected results from access request to include nes…

fa39289

…ted object and array data and expand postman collection to include more of the mongo array edge cases that we use in the mongo_example_test_dataset.yml file.

Give the postgres and mongo connection configs write access in postma…

71edc58

…n, so erasures can be automatically performed. Change the default erasure to use [email protected] to parallel the access request.

Add logging for debugging purposes.

054543e

Add guides for working with complex data (move nested object docs, an…

2f047cd

…d add new array docs).

Merge branch 'main' into fidesops_146_array_support

61244ce

# Conflicts: # docs/fidesops/docs/postman/Fidesops.postman_collection.json

pattisdr commented Feb 8, 2022

View reviewed changes

pattisdr commented Feb 9, 2022

View reviewed changes

pattisdr changed the title ~~Add support for Array Access Requests in MongoDB [#146]~~ Add support for Array Access Requests in MongoDB [#146][#147] Feb 9, 2022

pattisdr commented Feb 9, 2022

View reviewed changes

seanpreston self-assigned this Feb 9, 2022

pattisdr mentioned this pull request Feb 9, 2022

[Array + JSON Handling] Make Element Matching Configurable for Arrays #193

Closed

pattisdr added 2 commits February 10, 2022 12:24

Fix failing test - (CI is incorrectly showing green).

8efe79b

Merge remote-tracking branch 'ethyca/main' into fidesops_146_array_su…

fba51ec

…pport

pattisdr commented Feb 14, 2022

View reviewed changes

src/fidesops/task/refine_target_path.py Show resolved Hide resolved

seanpreston reviewed Feb 14, 2022

View reviewed changes

tests/integration_tests/test_mongo_task.py Show resolved Hide resolved

seanpreston reviewed Feb 14, 2022

View reviewed changes

src/fidesops/task/filter_element_match.py Show resolved Hide resolved

seanpreston reviewed Feb 14, 2022

View reviewed changes

src/fidesops/task/filter_element_match.py Outdated Show resolved Hide resolved

seanpreston reviewed Feb 14, 2022

View reviewed changes

src/fidesops/task/filter_results.py Outdated Show resolved Hide resolved

seanpreston reviewed Feb 15, 2022

View reviewed changes

data/nosql/mongo-init.js Show resolved Hide resolved

seanpreston reviewed Feb 15, 2022

View reviewed changes

tests/integration_tests/test_mongo_task.py Outdated Show resolved Hide resolved

seanpreston reviewed Feb 15, 2022

View reviewed changes

tests/task/test_consolidate_query_matches.py Outdated Show resolved Hide resolved

Rename remove_empty_objects to remove_empty_containers and use it to …

ff66c7b

…delete both empty arrays and empty dicts which can have a cascading effect.

Turn all customer ids into integers on mongo collections.

e30a76e

seanpreston reviewed Feb 15, 2022

View reviewed changes

src/fidesops/task/filter_results.py Outdated Show resolved Hide resolved

Rephrase docstring.

9269ce1

seanpreston approved these changes Feb 16, 2022

View reviewed changes

seanpreston merged commit a321ff6 into main Feb 16, 2022

seanpreston deleted the fidesops_146_array_support branch February 16, 2022 18:16

pattisdr mentioned this pull request Feb 17, 2022

Add MongoDB Support for Array Masking [#129] #226

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Array Access Requests in MongoDB [#146][#147] #194

Add support for Array Access Requests in MongoDB [#146][#147] #194

pattisdr commented Feb 8, 2022 •

edited

Loading

pattisdr Feb 8, 2022 •

edited

Loading

pattisdr Feb 8, 2022

pattisdr Feb 8, 2022

pattisdr Feb 9, 2022 •

edited

Loading

pattisdr Feb 9, 2022

pattisdr commented Feb 15, 2022

seanpreston left a comment

pattisdr commented Feb 16, 2022

Add support for Array Access Requests in MongoDB [#146][#147] #194

Add support for Array Access Requests in MongoDB [#146][#147] #194

Conversation

pattisdr commented Feb 8, 2022 • edited Loading

Purpose

Changes

Checklist

Ticket

pattisdr Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

pattisdr Feb 8, 2022

Choose a reason for hiding this comment

pattisdr Feb 8, 2022

Choose a reason for hiding this comment

pattisdr Feb 9, 2022 • edited Loading

Choose a reason for hiding this comment

pattisdr Feb 9, 2022

Choose a reason for hiding this comment

pattisdr commented Feb 15, 2022

seanpreston left a comment

Choose a reason for hiding this comment

pattisdr commented Feb 16, 2022

pattisdr commented Feb 8, 2022 •

edited

Loading

pattisdr Feb 8, 2022 •

edited

Loading

pattisdr Feb 9, 2022 •

edited

Loading