-
Notifications
You must be signed in to change notification settings - Fork 16
Conversation
return collections | ||
|
||
|
||
def generate_fields( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a merge_datasets
function but we need to preserve the order of the fields and we lose control of that as soon as we convert a simple dict to a FidesopsDataset
, the fields are alphabetized due to default Pydantic functionality.
pydantic/pydantic#593
}, | ||
], | ||
} | ||
api_data = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would a better test here include an api_data
object that has at least one of the existing collection elements placed in a different order than in existing_dataset
?
@@ -0,0 +1,667 @@ | |||
from tests.test_helpers.dataset_utils import generate_collections, generate_dataset | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice job with these tests - they're a great starting point for understanding the expected functionality of the utilities.
one follow up - would it be feasible/useful to include more of an "integration" level test that actually tests and demonstrates how the utility can work on a real API response, i.e. using a saas provider we already integrate with as an example? that test would obviously need to be marked as an integration_saas
test since it would require external communication, but just seems like it could be a nice-to-have. (if you think this is feasible, maybe we can split out a separate issue/PR for it, since i don't think it should necessarily block this PR)
# into generate_collections | ||
generated_collections = generate_collections( | ||
{ | ||
collection_name.replace(f"{existing_dataset['fides_key']}:", ""): collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this logic/processing covered in the unit tests? i know we may not be able to cover every corner of the dataset generation in unit tests, but it's some behavior that i think would be good to have covered both to ensure correct functionality and to clarify expected behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, i think my suggestion about an "integration" level test may help in providing coverage for this piece of things? that would address my concern, no need to make it specifically a unit test.
maybe just another reason to look into having that sort of "integration" test with more realistic data
the existing collections if no API data is available. | ||
""" | ||
|
||
# convert FidesopsDataset to Dataset to be able to use the Collection helpers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noting that i don't totally follow this, but probably due to my lack of understanding, so may be good to review together offline
tests/test_helpers/dataset_utils.py
Outdated
return object_list | ||
|
||
|
||
def get_data_type(value) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me, it feels like this function should live more centrally. what would you think of defining it somewhere within fidesops.graph.data_type.py
? from what i can tell, there's nothing yet defined there that takes an arbitrary input value/variable and gives the corresponding proper Fidesops
DataType
, but I think that's what you want here.
If you add that utility method over there, then I think here you should be relying upon the proper DataType
enum - passing around string literals here just feels a bit sloppy to me. Specifically, I think it's ripe for unnoticed regressions if/when any of the typing logic changes in the fidesops "taxonomy" (for lack of a better term), and it will be burdensome to maintain.
collection_map.keys(), | ||
): | ||
if len(rows := api_data.get(collection_name, [])): | ||
fields = generate_fields(rows[0], collection_name, collection_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily a feature for now, but a potential enhancement could be to allow for a configurable "sampling size" here so that potentially more than just the first row of response data is being used here. obviously we'd then need to build in some more advanced type-inference logic to take into account multiple values for each column, but just calling it out as an area for fine tuning.
it looks like there are some unit tests failing that i think are real failures? once that's addressed, i think i'm good to approve - there is one additional nitpick i threw on there for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, this all looks great - thanks for making those updates to move the data type logic more centrally, i think it will help for maintainability.
excited to see this in action!
Purpose
Adds a
update_dataset
function which takes aConnectionConfig
,DatasetConfig
, and API data to update the existing Dataset. This is meant to be used for SaaS connector development.Functionality
test_*_access_request_task
after therun_access_request
has returned a responseChecklist
Run Unsafe PR Checks
label has been applied, and checks have passed, if this PR touches any external services