-
Notifications
You must be signed in to change notification settings - Fork 16
Conversation
@@ -219,6 +220,14 @@ def cast(self, value: Any) -> Optional[Any]: | |||
return value | |||
|
|||
|
|||
@dataclass | |||
class MaskingOverride: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this is only being used in iteration in the query config as a temporary storage of (datatype, length). Maybe this class is not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a class, we have the benefit of type hints (vs using a namedtuple
), see https://stackoverflow.com/a/50038614/4957420
src/fidesops/graph/data_type.py
Outdated
"""Truncates value to given length""" | ||
logger.warning( | ||
"Length truncation not supported for ObjectId data_type. Using original masked value instead for update query." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all of these truncate methods, with the exception of string
, are identical, they should probably be implemented in the base class and just get overridden for string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
logger.warning( | ||
f"Unable to generate a query for field {field_name} due to: data_type of {masking_override.data_type} is not supported for the {strategy_config['strategy']} masking strategy" | ||
) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 4 separate "if not" conditions in this method, makes it a bit hard to follow. Is there a way to simplify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sure. Refactoring out some separate static methods to make this easier to read
|
||
value_map: Dict[str, Any] = {} | ||
for rule, field_names in rule_to_collection_fields.items(): | ||
strategy_config = rule.masking_strategy | ||
strategy = get_strategy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure this works in Python. I would do this: https://stackoverflow.com/a/6981771/4957420 but then I'd still have to iterate over the new filtered variable.
@@ -15,6 +15,7 @@ | |||
|
|||
|
|||
HASH = "hash" | |||
SUPPORTED_DATA_TYPES = ["string"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to have Supported Types
, it should probably be an abstract method or maybe a default implementation on the parent MaskingStrategy class. I don't think this should be a module level variable.
Also, since we're repeatedly calling "x in SUPPORTED_DATA_TYPES" , making SUPPORTED_DATA_TYPES a set (since we're only creating it once) will be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the abstract method data_type_supported
already, it's just the SUPPORTED_DATA_TYPES
const that's specific to each module. I think a data_type_supported
abstract method works better than a supported_data_types
abstract method bc it leaves the specific details up to the module. i.e. for null masking, we always support the data type, so the caller of a supported_data_types
abstract method would need to have a special case for null masking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all makes sense, I do think, though, that SUPPORTED_DATA_TYPES should be inside the class and not accessible in any other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll update!
logger.warning( | ||
f"Length truncation not supported for {T} data_type. Using original masked value instead for update query." | ||
) | ||
return val | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I test this, I get e.g.
DataType.float.value.truncate(1, 1)
>> Length truncation not supported for ~T data_type. Using original masked value instead for update query.
The only way I've been able to access that in the superclass is via:
def truncate(self, length: int, val: T) -> T:
typename = self.__class__.__orig_bases__[0].__args__[0].__name__
# this will be "float", "int", ...
logger.warning(f"not supported ... {typename}....")
But from what I've read this is not terribly stable and version dependent. Maybe just a static log method that only takes the string type name as a parameter would be cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, thinking how I'd obtain type in the abstract base class here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it'd be a bit convoluted https://stackoverflow.com/questions/57706180/generict-base-class-how-to-get-type-of-t-from-within-instance so I'm going with diff wording for now in this message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @pattisdr suggested instead of typename the much simpler
def truncate(self, length: int, val: T) -> T:
typename = self.__class__.__name__
Which will give you e.g. "IntTypeConverter"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooo awesome idea thanks!
src/fidesops/graph/data_type.py
Outdated
@@ -49,6 +62,10 @@ def empty_value(self) -> int: | |||
"""Empty int value""" | |||
return 0 | |||
|
|||
def truncate(self, length: int, val: int) -> int: | |||
"""Truncates value to given length""" | |||
return int(str(val)[:length]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not the right way to truncate an int.
The reason for our truncation is so that we don't generate values that will fail on insert when we mask data because they won't fit in the target database. So, e.g. here are mysql types: https://dev.mysql.com/doc/refman/8.0/en/integer-types.html
These are truncated by bytes, so, e.g. if you had a mysql column defined as an INT and you tried to insert a value greater than 2147483647 it would fail.
I think this can be deferred ( for the moment at least - it was never supported in atlas AFAIK) but we shouldn't be truncating alphabetically, in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. Would byte truncation be diff for nosql vs sql?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know. The thing I posted was mysql specific. All of the nosql databases I've worked with sidestep this issue by basically allowing you to write anything - e.g. dynamo only uses BigDecimals in the java implementation .
I think it shoiuld only matter with defined schema types. I don't think it's hugely common to define things like TINYINT columns (probably mattered more when people were very concerned with field sizes) , but it could happen. Don't know if this has every been an issue with Atlas
* adding some types and questions for tech discussion * adding some types and questions for tech discussion * removes hmac, aes, and hash from supported strategies for now * adds supported data types to masking strategies * adds logger warning if query cannot be generated to do data type configs * adds support for truncating specific data types * gets current tests passing * adds test for length truncation * adds docs * cr * moving supported_data_types inside class * remove type var * remove int truncation * use class name to identify type Co-authored-by: catherinesmith <[email protected]>
Purpose
length
anddata_type
params onField
[ ] HMAC
[ ] Hash
[ ] AES Encryption
[ x ] Null
[ x ] Random string rewrite
[ x ] Default string rewrite
Changes
data_type
s to each masking strategydata_type
onField
for masking strategies that are notnull_rewrite
DataTypeConverter
to truncate val to specifiedlength
if provided (we only truncateinteger
andstring
data types)Checklist
Ticket
Fixes #70