Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

Add Configuration Option for Entrypoint Array Querying [#193] #229

Merged
merged 6 commits into from
Mar 1, 2022

Conversation

pattisdr
Copy link
Contributor

@pattisdr pattisdr commented Feb 22, 2022

Purpose

Add a configuration option for a user to specify they want all elements of an entrypoint array field to be returned and/or masked, instead of the default behavior, which is just to return matched array elements.

By "entrypoint" I mean that an array field is the referenced field, or the field that is used to locate records on the collection. If the array field is not a referenced field on the collection, this annotation is irrelevant.

Changes

  • Add a dataset configuration option to specify return_all_elements=True on an array entrypoint field, to assert that all array elements will be returned from queried documents. The existing default behavior is that return_all_elements is False, and only matched array elements are returned. (Suggestions welcome for better names than return_all_elements, this was the best I could come up with. Ideally it's a variable that means the same thing so the default is False).
  • Adds validation so that this field can only be set on array fields. If it's not a referenced field either, it's ignored.
  • Adds two examples to mongo_example_test_dataset.yml where return_all_elements=True, one on an array of strings, and one on an array of objects. The sample dataset already has several examples of the default behavior return_all_elements=False.
  • Adds new GraphTask.access_results_post_processing method now that we have several things we need to do on returned data from a node.
  • Adds the updated dataset to the postman collection

Examples

return_all_elements=True NEW

return_all_elements=True may be appropriate for your collection if your data is organized such that all fields in an array field are relevant. In this example, I specified the return_all_elements=true on emails. Therefore, If we go to select records with emails that equal [email protected], the returned data will include the entire emails array, even though only one of the three elements matched: {"emails": ["[email protected]", "[email protected]", "[email protected]"], "employee_id": 1, "employee_username": "dawn1"}. This is the data that is potentially filtered and /or masked from the users collection and is also the data we use to find records on other collections.

Annotated Dataset

dataset:
  - fides_key: mongo_return_all_elements_true_example
    name:  Mongo Collection
    description: Sample Mongo Dataset
    collections:
      - name: users
        fields:
          - name: _id
            data_categories: [system.operations]
            fidesops_meta:
              primary_key: True
          - name: emails
            data_categories: [ user.provided.identifiable.contact.email ]
            fidesops_meta:
              data_type: string[]
              identity: email
              return_all_elements: true
          - name: employee_id
            data_categories: [system.operations]
          - name: employee_username
            data_categories: [ user.provided.identifiable ]

Sample data:

[
   {
      "_id": "ffdlllooolds",
      "emails":[
         "[email protected]",
         "[email protected]",
         "[email protected]"
      ],
      "employee_id":1,
      "employee_username":"dawn1"
   },
   {
      "_id": "dapdowwkd",
      "emails":[
         "[email protected]",
         "[email protected]"
      ],
      "employee_id":2,
      "employee_username":"gracem"
   }
]

return_all_elements=False Existing Default Behavior

In this example, my data is arranged such that only matching sub-documents would be relevant. I can use the default behavior, which is return_all_elements=False. This does not need to be specified, and only returns matching elements. If I search the company collection for employees whose email is [email protected], only matching sub-documents are returned, not the entire employees array:

Returned:

{
      "_id": "sect_1",
      "section":"engineering",
      "employees":[
         {
            "email":"[email protected]",
            "employee_id":1,
            "employee_username":"dawn1"
         }
      ]
   }

Only this subdocument is potentially used to locate records on other collections, and only information in this subdocument is potentially returned and masked.

Annotated Dataset

dataset:
  - fides_key: mongo_return_all_elements_false_example
    name:  Mongo Collection
    description: Sample Mongo Dataset
    collections:
      - name: company
        fields:
          - name: _id
            data_categories: [system.operations]
            fidesops_meta:
              primary_key: True
          - name: section
          - name: employees
            fidesops_meta:
              data_type: object[]
            fields:
              - name: email
                identity: email
                data_categories: [ user.provided.identifiable.contact.email ]
              - name: employee_id
                data_categories: [system.operations]
              - name: employee_username
                data_categories: [ user.provided.identifiable ]
         

Sample data:

[
   {
      "_id": "sect_1",
      "section":"engineering",
      "employees":[
         {
            "email":"[email protected]",
            "employee_id":1,
            "employee_username":"dawn1"
         },
         {
            "email":"[email protected]",
            "employee_id":2,
            "employee_username":"gracem"
         }
      ]
   },
   {
      "_id": "sect_2",
      "section":"sales",
      "employees":[
         {
            "email":"[email protected]",
            "employee_id":3,
            "employee_username":"jimmyd"
         },
         {
            "email":"[email protected]",
            "employee_id":4,
            "employee_username":"johnf"
         }
      ]
   }
]

Checklist

  • Applicable documentation updated (guides, quickstart, postman collections, tutorial, fidesdemo, database diagram)
  • Good unit test/integration test coverage
  • This PR contains a DB migration. If checked, the reviewer should confirm with the author that the down_revision correctly references the previous migration before merging
  • The Run Unsafe PR Checks label has been applied, and checks have passed, if this PR touches any external services

Ticket

Fixes #193

@pattisdr pattisdr linked an issue Feb 22, 2022 that may be closed by this pull request
@pattisdr pattisdr added the run unsafe ci checks Triggers running of unsafe CI checks label Feb 22, 2022
@pattisdr pattisdr changed the title Configure Entrypoint Array Querying [#193] Add Configuration Option for Entrypoint Array Querying [#193] Feb 22, 2022
Comment on lines +260 to +270
if field.return_all_elements:
# All data will be returned
out[path] = None
else:
# Default behavior - we will filter values to match those in filtered
cast_values = [
field.cast(v) for v in values
] # Cast values to expected type where possible
filtered = list(filter(lambda x: x is not None, cast_values))
if filtered:
out[path] = filtered
Copy link
Contributor Author

@pattisdr pattisdr Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Location of primary change - if return_all_elements has been specified on an incoming array field (and/or propagated down to a sub_field), we specify no data to match, so data will not be filtered. Otherwise, we specify data to match, which is the default behavior, and only matched elements will be returned.

@pattisdr
Copy link
Contributor Author

pattisdr commented Feb 23, 2022

Will probably wait to get this up to date until #226 is merged because it's going to have to be rebased

EDIT: Up-to-date!

Base automatically changed from fidesops_129_mongodb_array_mask to main February 25, 2022 21:48
…ield should return all elements or just the matching elements. The default is to just return matched elements, but specifying "return_all_elements=true" will return every element.

- Rename GraphTask.to_dask_input_data to GraphTask.pre_process_input_data now that there are two rounds of processing on input data.
…wards collection and adjust postman collection to include new mongo collections, some with example return_all_elements set to True.
@pattisdr pattisdr force-pushed the fidesops_193_array_matching_config branch from f5f13de to 9084c93 Compare February 25, 2022 21:59
@pattisdr pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Feb 25, 2022
@seanpreston seanpreston self-assigned this Feb 28, 2022
1. For example, a query on Collection A only matched indices 0 and 1 in an array. Only the data located at indices 0 and 1 are used to query data on dependent collection C.
3. By default, if an array field is an entry point to a node, only matching indices in that array are considered, both for access and erasures, as well as for subsequent queries on dependent collections where applicable.
1. For example, a query on Collection A only matched indices 0 and 1 in an array. Only the data located at indices 0 and 1 will be returned, and used to query data on dependent collection C.
2. This can be overridden by specifying `return_all_elements=true` on an entrypoint array field, in which case, the query will return the entire array and mask the entire array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great description 👍 . Are we better off using yaml notation in the example we give, since that's consistent with the example datasets? ie. return_all_elements: true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good point

if not meta_values:
return meta_values

is_array: bool = bool(meta_values.data_type and "[]" in meta_values.data_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: I think meta_values.data_type.endswith("[]") might be slightly more performant since it will start at the end of the string and not the start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change!

Copy link
Contributor

@seanpreston seanpreston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work @pattisdr, just one tweak

Copy link
Contributor

@seanpreston seanpreston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I just tested it again, the yaml will be successfully transformed from false to False if that is explicitly specified, so does not cause the aforementioned issue.

@pattisdr
Copy link
Contributor Author

thanks for the review @seanpreston, i'll get those comments addressed!

@pattisdr pattisdr added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Mar 1, 2022
@pattisdr
Copy link
Contributor Author

pattisdr commented Mar 1, 2022

Changes made @seanpreston!

@seanpreston seanpreston merged commit a399abb into main Mar 1, 2022
@seanpreston seanpreston deleted the fidesops_193_array_matching_config branch March 1, 2022 03:41
sanders41 pushed a commit that referenced this pull request Sep 22, 2022
* Add a dataset config option to specify whether an array entry point field should return all elements or just the matching elements.  The default is to just return matched elements, but specifying "return_all_elements=true" will return every element.

- Rename GraphTask.to_dask_input_data to GraphTask.pre_process_input_data now that there are two rounds of processing on input data.

* Add return_all_elements=True config option to docs.

* Adjust quickstart docs, now that example includes results from the rewards collection and adjust postman collection to include new mongo collections, some with example return_all_elements set to True.

* Switch out connection config - test

* Fix missed items from rebase.

* Use yaml formatting in docs and use endswith to save minor time.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
run unsafe ci checks Triggers running of unsafe CI checks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Array + JSON Handling] Make Element Matching Configurable for Arrays
2 participants