Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

[Array + JSON Handling] Make Element Matching Configurable for Arrays #193

Closed
pattisdr opened this issue Feb 8, 2022 · 3 comments · Fixed by #229
Closed

[Array + JSON Handling] Make Element Matching Configurable for Arrays #193

pattisdr opened this issue Feb 8, 2022 · 3 comments · Fixed by #229
Assignees
Labels
enhancement New feature or request

Comments

@pattisdr
Copy link
Contributor

pattisdr commented Feb 8, 2022

Is your feature request related to a specific problem?

This is being split off of #146, as that PR is getting large. This also might be worth postponing addressing until we do erasures on arrays.

In array use case discussions, it was brought up that the user might want to configure some array behavior. Specifically, introducing array support adds some ambiguity into how queries should be built. A query on an array field (+ field is a reference field or contains identity data) may match multiple indices in an array, or multiple subdocuments inside the array. The default behavior I've built into #146 is that only matched array indices are returned on a document, and only those matched values are used to make subsequent queries. This is a more conservative approach, in that it selects less/masks less, but may not cover all user's data structures.

(This is separate from the after-the-fact filtering we do on data categories. This is related to what pieces of array data we return and/or use for other queries).

Describe the solution you'd like

Allow the user to configure in the dataset yaml that all values in the discovered array field should be returned and/or used to make subsequent queries, if they don't want the default. Specifically there's a filter_element_match method that we run (we do this in code, not at the database level) that removes non matched elements from arrays, but we could be more targeted as to which fields we run this on.

Describe alternatives you've considered, if any

A description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

@pattisdr pattisdr added the enhancement New feature or request label Feb 8, 2022
@pattisdr pattisdr self-assigned this Feb 8, 2022
@pattisdr
Copy link
Contributor Author

pattisdr commented Feb 9, 2022

For example, say for the "people" collection we run a query that says "return records where users.id = 1".
untitled (1)
That query returns the first record, {"users": ["id": 1, "ccn": 12345, "name": "Dawn"}, {"id": 2, "ccn": 54321", "name": "Jane"}], "employees": []} because it contains the id:1 somewhere in there. Say we're supposed to use the ccn here to lookup records in the credit card collection. Because this is an array, I filter out the non-matched elements: {"users": ["id": 1, "ccn": 12345, "name": "Dawn"}], "employees": []} , save that as the result for the people collection, and then pass this data to the credit card collection to only search for credit cards where the id is 12345.

This is the current default behavior I've added to #194 and the ticket here is to potentially allow the user to say, no actually I want everything in that array, both ccn: 12345 and ccn:54321 to be used to lookup records in the credit card table and I want you to return the entire array as the result for the people collection too.

@iamkelllly
Copy link
Contributor

Specifically there's a filter_element_match method that we run (we do this in code, not at the database level) that removes non matched elements from arrays, but we could be more targeted as to which fields we run this on.

I think it makes a lot of sense to have a config per field element to determine whether the adopter wants to filter out any non-applicable array elements for the next query vs keep them for using to query.

Would the config be something like this?
use_entire_array_to_query=true to use the full array to query the next location
use_entire_array_to_query=false to remove non-matched elements in the array when querying the next location
(obviously we need to select a different attribute name)

dataset:
  - fides_key: mongo_test
    name: Mongo Example Test Dataset
    description: Example of a Mongo dataset that contains 'details' about customers defined in the 'postgres_example_test_dataset'
    collections:
      - name: people_collection
        fields:
          - name: _id
            data_categories: [system.operations]
            fidesops_meta:
              primary_key: True
          - name: ccn
            data_categories: [user.provided.financial.accountnumber]
            fidesops_meta:
              data_type: string[]
              use_entire_array_to_query: true
              references:
                - dataset: mongo_test
                  field: credit_card_collection.id
                  direction: from
          - name: name
            data_categories: [user.provided.identifiable.name]
            fidesops_meta:
              data_type: string
      - name: credit_card_collection
        fields:
          - name: _id
            data_categories: [system.operations]
            fidesops_meta:
              primary_key: True
          - name: zip
          - name: balance
...

(forgive the broken/bad yaml)

@pattisdr
Copy link
Contributor Author

pattisdr commented Feb 9, 2022

OK great @iamkelllly, that's the main thing I needed, that this configuration option still makes sense in light of framing it as a "query concern" rather than a "data category" concern. One major clarification, this potentially affects both what data is returned from the current node, as well as what data is passed onto other nodes.

I think something like what you're spelling out will work. Importantly, this annotation is only applicable on array fields that are also referenced fields or identity fields, but perhaps this is something I just validate when it's created so it's clear. I can't really just stick it in references or just stick it with identity data or just stick it with array annotations since those are in three separate spots.

@iamkelllly iamkelllly added this to the Fidesops 1.3.0 milestone Feb 10, 2022
@pattisdr pattisdr linked a pull request Feb 22, 2022 that will close this issue
4 tasks
seanpreston pushed a commit that referenced this issue Mar 1, 2022
* Add a dataset config option to specify whether an array entry point field should return all elements or just the matching elements.  The default is to just return matched elements, but specifying "return_all_elements=true" will return every element.

- Rename GraphTask.to_dask_input_data to GraphTask.pre_process_input_data now that there are two rounds of processing on input data.

* Add return_all_elements=True config option to docs.

* Adjust quickstart docs, now that example includes results from the rewards collection and adjust postman collection to include new mongo collections, some with example return_all_elements set to True.

* Switch out connection config - test

* Fix missed items from rebase.

* Use yaml formatting in docs and use endswith to save minor time.
sanders41 pushed a commit that referenced this issue Sep 22, 2022
* Add a dataset config option to specify whether an array entry point field should return all elements or just the matching elements.  The default is to just return matched elements, but specifying "return_all_elements=true" will return every element.

- Rename GraphTask.to_dask_input_data to GraphTask.pre_process_input_data now that there are two rounds of processing on input data.

* Add return_all_elements=True config option to docs.

* Adjust quickstart docs, now that example includes results from the rewards collection and adjust postman collection to include new mongo collections, some with example return_all_elements set to True.

* Switch out connection config - test

* Fix missed items from rebase.

* Use yaml formatting in docs and use endswith to save minor time.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants