[Array + JSON Handling] Make Element Matching Configurable for Arrays #193

pattisdr · 2022-02-08T22:49:36Z

Is your feature request related to a specific problem?

This is being split off of #146, as that PR is getting large. This also might be worth postponing addressing until we do erasures on arrays.

In array use case discussions, it was brought up that the user might want to configure some array behavior. Specifically, introducing array support adds some ambiguity into how queries should be built. A query on an array field (+ field is a reference field or contains identity data) may match multiple indices in an array, or multiple subdocuments inside the array. The default behavior I've built into #146 is that only matched array indices are returned on a document, and only those matched values are used to make subsequent queries. This is a more conservative approach, in that it selects less/masks less, but may not cover all user's data structures.

(This is separate from the after-the-fact filtering we do on data categories. This is related to what pieces of array data we return and/or use for other queries).

Describe the solution you'd like

Allow the user to configure in the dataset yaml that all values in the discovered array field should be returned and/or used to make subsequent queries, if they don't want the default. Specifically there's a filter_element_match method that we run (we do this in code, not at the database level) that removes non matched elements from arrays, but we could be more targeted as to which fields we run this on.

Describe alternatives you've considered, if any

A description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

pattisdr · 2022-02-09T20:23:03Z

For example, say for the "people" collection we run a query that says "return records where users.id = 1".

That query returns the first record, {"users": ["id": 1, "ccn": 12345, "name": "Dawn"}, {"id": 2, "ccn": 54321", "name": "Jane"}], "employees": []} because it contains the id:1 somewhere in there. Say we're supposed to use the ccn here to lookup records in the credit card collection. Because this is an array, I filter out the non-matched elements: {"users": ["id": 1, "ccn": 12345, "name": "Dawn"}], "employees": []} , save that as the result for the people collection, and then pass this data to the credit card collection to only search for credit cards where the id is 12345.

This is the current default behavior I've added to #194 and the ticket here is to potentially allow the user to say, no actually I want everything in that array, both ccn: 12345 and ccn:54321 to be used to lookup records in the credit card table and I want you to return the entire array as the result for the people collection too.

iamkelllly · 2022-02-09T20:37:14Z

Specifically there's a filter_element_match method that we run (we do this in code, not at the database level) that removes non matched elements from arrays, but we could be more targeted as to which fields we run this on.

I think it makes a lot of sense to have a config per field element to determine whether the adopter wants to filter out any non-applicable array elements for the next query vs keep them for using to query.

Would the config be something like this?
use_entire_array_to_query=true to use the full array to query the next location
use_entire_array_to_query=false to remove non-matched elements in the array when querying the next location
(obviously we need to select a different attribute name)

dataset:
  - fides_key: mongo_test
    name: Mongo Example Test Dataset
    description: Example of a Mongo dataset that contains 'details' about customers defined in the 'postgres_example_test_dataset'
    collections:
      - name: people_collection
        fields:
          - name: _id
            data_categories: [system.operations]
            fidesops_meta:
              primary_key: True
          - name: ccn
            data_categories: [user.provided.financial.accountnumber]
            fidesops_meta:
              data_type: string[]
              use_entire_array_to_query: true
              references:
                - dataset: mongo_test
                  field: credit_card_collection.id
                  direction: from
          - name: name
            data_categories: [user.provided.identifiable.name]
            fidesops_meta:
              data_type: string
      - name: credit_card_collection
        fields:
          - name: _id
            data_categories: [system.operations]
            fidesops_meta:
              primary_key: True
          - name: zip
          - name: balance
...

(forgive the broken/bad yaml)

pattisdr · 2022-02-09T21:40:15Z

OK great @iamkelllly, that's the main thing I needed, that this configuration option still makes sense in light of framing it as a "query concern" rather than a "data category" concern. One major clarification, this potentially affects both what data is returned from the current node, as well as what data is passed onto other nodes.

I think something like what you're spelling out will work. Importantly, this annotation is only applicable on array fields that are also referenced fields or identity fields, but perhaps this is something I just validate when it's created so it's clear. I can't really just stick it in references or just stick it with identity data or just stick it with array annotations since those are in three separate spots.

* Add a dataset config option to specify whether an array entry point field should return all elements or just the matching elements. The default is to just return matched elements, but specifying "return_all_elements=true" will return every element. - Rename GraphTask.to_dask_input_data to GraphTask.pre_process_input_data now that there are two rounds of processing on input data. * Add return_all_elements=True config option to docs. * Adjust quickstart docs, now that example includes results from the rewards collection and adjust postman collection to include new mongo collections, some with example return_all_elements set to True. * Switch out connection config - test * Fix missed items from rebase. * Use yaml formatting in docs and use endswith to save minor time.

pattisdr added the enhancement New feature or request label Feb 8, 2022

pattisdr self-assigned this Feb 8, 2022

pattisdr mentioned this issue Feb 8, 2022

Add support for Array Access Requests in MongoDB [#146][#147] #194

Merged

2 tasks

iamkelllly added this to the Fidesops 1.3.0 milestone Feb 10, 2022

pattisdr mentioned this issue Feb 22, 2022

Add Configuration Option for Entrypoint Array Querying [#193] #229

Merged

4 tasks

pattisdr linked a pull request Feb 22, 2022 that will close this issue

Add Configuration Option for Entrypoint Array Querying [#193] #229

Merged

4 tasks

seanpreston closed this as completed in #229 Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Array + JSON Handling] Make Element Matching Configurable for Arrays #193

[Array + JSON Handling] Make Element Matching Configurable for Arrays #193

pattisdr commented Feb 8, 2022 •

edited

Loading

pattisdr commented Feb 9, 2022 •

edited

Loading

iamkelllly commented Feb 9, 2022

pattisdr commented Feb 9, 2022

[Array + JSON Handling] Make Element Matching Configurable for Arrays #193

[Array + JSON Handling] Make Element Matching Configurable for Arrays #193

Comments

pattisdr commented Feb 8, 2022 • edited Loading

Is your feature request related to a specific problem?

Describe the solution you'd like

Describe alternatives you've considered, if any

Additional context

pattisdr commented Feb 9, 2022 • edited Loading

iamkelllly commented Feb 9, 2022

pattisdr commented Feb 9, 2022

pattisdr commented Feb 8, 2022 •

edited

Loading

pattisdr commented Feb 9, 2022 •

edited

Loading