-
Notifications
You must be signed in to change notification settings - Fork 16
[Array + JSON Handling] Make Element Matching Configurable for Arrays #193
Comments
For example, say for the "people" collection we run a query that says "return records where users.id = 1". This is the current default behavior I've added to #194 and the ticket here is to potentially allow the user to say, no actually I want everything in that array, both ccn: 12345 and ccn:54321 to be used to lookup records in the credit card table and I want you to return the entire array as the result for the people collection too. |
I think it makes a lot of sense to have a config per field element to determine whether the adopter wants to filter out any non-applicable array elements for the next query vs keep them for using to query. Would the config be something like this? dataset:
- fides_key: mongo_test
name: Mongo Example Test Dataset
description: Example of a Mongo dataset that contains 'details' about customers defined in the 'postgres_example_test_dataset'
collections:
- name: people_collection
fields:
- name: _id
data_categories: [system.operations]
fidesops_meta:
primary_key: True
- name: ccn
data_categories: [user.provided.financial.accountnumber]
fidesops_meta:
data_type: string[]
use_entire_array_to_query: true
references:
- dataset: mongo_test
field: credit_card_collection.id
direction: from
- name: name
data_categories: [user.provided.identifiable.name]
fidesops_meta:
data_type: string
- name: credit_card_collection
fields:
- name: _id
data_categories: [system.operations]
fidesops_meta:
primary_key: True
- name: zip
- name: balance
...
(forgive the broken/bad yaml) |
OK great @iamkelllly, that's the main thing I needed, that this configuration option still makes sense in light of framing it as a "query concern" rather than a "data category" concern. One major clarification, this potentially affects both what data is returned from the current node, as well as what data is passed onto other nodes. I think something like what you're spelling out will work. Importantly, this annotation is only applicable on array fields that are also referenced fields or identity fields, but perhaps this is something I just validate when it's created so it's clear. I can't really just stick it in references or just stick it with identity data or just stick it with array annotations since those are in three separate spots. |
* Add a dataset config option to specify whether an array entry point field should return all elements or just the matching elements. The default is to just return matched elements, but specifying "return_all_elements=true" will return every element. - Rename GraphTask.to_dask_input_data to GraphTask.pre_process_input_data now that there are two rounds of processing on input data. * Add return_all_elements=True config option to docs. * Adjust quickstart docs, now that example includes results from the rewards collection and adjust postman collection to include new mongo collections, some with example return_all_elements set to True. * Switch out connection config - test * Fix missed items from rebase. * Use yaml formatting in docs and use endswith to save minor time.
* Add a dataset config option to specify whether an array entry point field should return all elements or just the matching elements. The default is to just return matched elements, but specifying "return_all_elements=true" will return every element. - Rename GraphTask.to_dask_input_data to GraphTask.pre_process_input_data now that there are two rounds of processing on input data. * Add return_all_elements=True config option to docs. * Adjust quickstart docs, now that example includes results from the rewards collection and adjust postman collection to include new mongo collections, some with example return_all_elements set to True. * Switch out connection config - test * Fix missed items from rebase. * Use yaml formatting in docs and use endswith to save minor time.
Is your feature request related to a specific problem?
This is being split off of #146, as that PR is getting large. This also might be worth postponing addressing until we do erasures on arrays.
In array use case discussions, it was brought up that the user might want to configure some array behavior. Specifically, introducing array support adds some ambiguity into how queries should be built. A query on an array field (+ field is a
reference
field or contains identity data) may match multiple indices in an array, or multiple subdocuments inside the array. The default behavior I've built into #146 is that only matched array indices are returned on a document, and only those matched values are used to make subsequent queries. This is a more conservative approach, in that it selects less/masks less, but may not cover all user's data structures.(This is separate from the after-the-fact filtering we do on data categories. This is related to what pieces of array data we return and/or use for other queries).
Describe the solution you'd like
Allow the user to configure in the dataset yaml that all values in the discovered array field should be returned and/or used to make subsequent queries, if they don't want the default. Specifically there's a
filter_element_match
method that we run (we do this in code, not at the database level) that removes non matched elements from arrays, but we could be more targeted as to which fields we run this on.Describe alternatives you've considered, if any
A description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: