-
Notifications
You must be signed in to change notification settings - Fork 16
Add Configuration Option for Entrypoint Array Querying [#193] #229
Conversation
if field.return_all_elements: | ||
# All data will be returned | ||
out[path] = None | ||
else: | ||
# Default behavior - we will filter values to match those in filtered | ||
cast_values = [ | ||
field.cast(v) for v in values | ||
] # Cast values to expected type where possible | ||
filtered = list(filter(lambda x: x is not None, cast_values)) | ||
if filtered: | ||
out[path] = filtered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Location of primary change - if return_all_elements
has been specified on an incoming array field (and/or propagated down to a sub_field), we specify no data to match, so data will not be filtered. Otherwise, we specify data to match, which is the default behavior, and only matched elements will be returned.
Will probably wait to get this up to date until #226 is merged because it's going to have to be rebased EDIT: Up-to-date! |
…ield should return all elements or just the matching elements. The default is to just return matched elements, but specifying "return_all_elements=true" will return every element. - Rename GraphTask.to_dask_input_data to GraphTask.pre_process_input_data now that there are two rounds of processing on input data.
…wards collection and adjust postman collection to include new mongo collections, some with example return_all_elements set to True.
f5f13de
to
9084c93
Compare
1. For example, a query on Collection A only matched indices 0 and 1 in an array. Only the data located at indices 0 and 1 are used to query data on dependent collection C. | ||
3. By default, if an array field is an entry point to a node, only matching indices in that array are considered, both for access and erasures, as well as for subsequent queries on dependent collections where applicable. | ||
1. For example, a query on Collection A only matched indices 0 and 1 in an array. Only the data located at indices 0 and 1 will be returned, and used to query data on dependent collection C. | ||
2. This can be overridden by specifying `return_all_elements=true` on an entrypoint array field, in which case, the query will return the entire array and mask the entire array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great description 👍 . Are we better off using yaml notation in the example we give, since that's consistent with the example datasets? ie. return_all_elements: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good point
src/fidesops/schemas/dataset.py
Outdated
if not meta_values: | ||
return meta_values | ||
|
||
is_array: bool = bool(meta_values.data_type and "[]" in meta_values.data_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: I think meta_values.data_type.endswith("[]")
might be slightly more performant since it will start at the end of the string and not the start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great work @pattisdr, just one tweak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I just tested it again, the yaml will be successfully transformed from false
to False
if that is explicitly specified, so does not cause the aforementioned issue.
thanks for the review @seanpreston, i'll get those comments addressed! |
Changes made @seanpreston! |
* Add a dataset config option to specify whether an array entry point field should return all elements or just the matching elements. The default is to just return matched elements, but specifying "return_all_elements=true" will return every element. - Rename GraphTask.to_dask_input_data to GraphTask.pre_process_input_data now that there are two rounds of processing on input data. * Add return_all_elements=True config option to docs. * Adjust quickstart docs, now that example includes results from the rewards collection and adjust postman collection to include new mongo collections, some with example return_all_elements set to True. * Switch out connection config - test * Fix missed items from rebase. * Use yaml formatting in docs and use endswith to save minor time.
Purpose
Add a configuration option for a user to specify they want all elements of an entrypoint array field to be returned and/or masked, instead of the default behavior, which is just to return matched array elements.
By "entrypoint" I mean that an array field is the referenced field, or the field that is used to locate records on the collection. If the array field is not a referenced field on the collection, this annotation is irrelevant.
Changes
return_all_elements=True
on an array entrypoint field, to assert that all array elements will be returned from queried documents. The existing default behavior is thatreturn_all_elements
is False, and only matched array elements are returned. (Suggestions welcome for better names thanreturn_all_elements
, this was the best I could come up with. Ideally it's a variable that means the same thing so the default is False).mongo_example_test_dataset.yml
wherereturn_all_elements=True
, one on an array of strings, and one on an array of objects. The sample dataset already has several examples of the default behaviorreturn_all_elements=False
.GraphTask.access_results_post_processing
method now that we have several things we need to do on returned data from a node.Examples
return_all_elements=True NEW
return_all_elements=True
may be appropriate for your collection if your data is organized such that all fields in an array field are relevant. In this example, I specified thereturn_all_elements=true
onemails
. Therefore, If we go to select records with emails that equal[email protected]
, the returned data will include the entire emails array, even though only one of the three elements matched:{"emails": ["[email protected]", "[email protected]", "[email protected]"], "employee_id": 1, "employee_username": "dawn1"}
. This is the data that is potentially filtered and /or masked from the users collection and is also the data we use to find records on other collections.Annotated Dataset
Sample data:
return_all_elements=False Existing Default Behavior
In this example, my data is arranged such that only matching sub-documents would be relevant. I can use the default behavior, which is
return_all_elements=False
. This does not need to be specified, and only returns matching elements. If I search the company collection for employees whose email is[email protected]
, only matching sub-documents are returned, not the entire employees array:Returned:
Only this subdocument is potentially used to locate records on other collections, and only information in this subdocument is potentially returned and masked.
Annotated Dataset
Sample data:
Checklist
Run Unsafe PR Checks
label has been applied, and checks have passed, if this PR touches any external servicesTicket
Fixes #193