Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

70 feat adds policy driven masking #108

Merged
merged 15 commits into from
Dec 3, 2021

Conversation

eastandwestwind
Copy link
Contributor

@eastandwestwind eastandwestwind commented Nov 30, 2021

Purpose

  • Adds handling for new length and data_type params on Field
  • Restricts supported masking strategies to the following:
    [ ] HMAC
    [ ] Hash
    [ ] AES Encryption
    [ x ] Null
    [ x ] Random string rewrite
    [ x ] Default string rewrite

Changes

  • Adds supported data_types to each masking strategy
  • Requires a data_type on Field for masking strategies that are not null_rewrite
  • Adds methods to each DataTypeConverter to truncate val to specified length if provided (we only truncate integer and string data types)

Checklist

  • Applicable documentation updated (guides, quickstart, postman collections, tutorial, fidesdemo)
  • Good unit test/integration test coverage

Ticket

Fixes #70

@@ -219,6 +220,14 @@ def cast(self, value: Any) -> Optional[Any]:
return value


@dataclass
class MaskingOverride:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this is only being used in iteration in the query config as a temporary storage of (datatype, length). Maybe this class is not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a class, we have the benefit of type hints (vs using a namedtuple), see https://stackoverflow.com/a/50038614/4957420

"""Truncates value to given length"""
logger.warning(
"Length truncation not supported for ObjectId data_type. Using original masked value instead for update query."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of these truncate methods, with the exception of string, are identical, they should probably be implemented in the base class and just get overridden for string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

logger.warning(
f"Unable to generate a query for field {field_name} due to: data_type of {masking_override.data_type} is not supported for the {strategy_config['strategy']} masking strategy"
)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 4 separate "if not" conditions in this method, makes it a bit hard to follow. Is there a way to simplify this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure. Refactoring out some separate static methods to make this easier to read


value_map: Dict[str, Any] = {}
for rule, field_names in rule_to_collection_fields.items():
strategy_config = rule.masking_strategy
strategy = get_strategy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about on #139: for rule, field_names in rule_to_collection_fields.items() if rule.masking_strategy: to get rid of the extra conditional on #141?

Copy link
Contributor Author

@eastandwestwind eastandwestwind Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure this works in Python. I would do this: https://stackoverflow.com/a/6981771/4957420 but then I'd still have to iterate over the new filtered variable.

@@ -15,6 +15,7 @@


HASH = "hash"
SUPPORTED_DATA_TYPES = ["string"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have Supported Types, it should probably be an abstract method or maybe a default implementation on the parent MaskingStrategy class. I don't think this should be a module level variable.

Also, since we're repeatedly calling "x in SUPPORTED_DATA_TYPES" , making SUPPORTED_DATA_TYPES a set (since we're only creating it once) will be faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the abstract method data_type_supported already, it's just the SUPPORTED_DATA_TYPES const that's specific to each module. I think a data_type_supported abstract method works better than a supported_data_types abstract method bc it leaves the specific details up to the module. i.e. for null masking, we always support the data type, so the caller of a supported_data_types abstract method would need to have a special case for null masking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all makes sense, I do think, though, that SUPPORTED_DATA_TYPES should be inside the class and not accessible in any other way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll update!

logger.warning(
f"Length truncation not supported for {T} data_type. Using original masked value instead for update query."
)
return val

Copy link
Contributor

@stevenbenjamin stevenbenjamin Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I test this, I get e.g.

DataType.float.value.truncate(1, 1)
>> Length truncation not supported for ~T data_type. Using original masked value instead for update query.

The only way I've been able to access that in the superclass is via:

def truncate(self, length: int, val: T) -> T:
  typename = self.__class__.__orig_bases__[0].__args__[0].__name__ 
  # this will be "float", "int", ... 
  logger.warning(f"not supported ...  {typename}....")

But from what I've read this is not terribly stable and version dependent. Maybe just a static log method that only takes the string type name as a parameter would be cleaner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, thinking how I'd obtain type in the abstract base class here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it'd be a bit convoluted https://stackoverflow.com/questions/57706180/generict-base-class-how-to-get-type-of-t-from-within-instance so I'm going with diff wording for now in this message

Copy link
Contributor

@stevenbenjamin stevenbenjamin Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @pattisdr suggested instead of typename the much simpler

def truncate(self, length: int, val: T) -> T:
  typename = self.__class__.__name__ 

Which will give you e.g. "IntTypeConverter"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo awesome idea thanks!

@@ -49,6 +62,10 @@ def empty_value(self) -> int:
"""Empty int value"""
return 0

def truncate(self, length: int, val: int) -> int:
"""Truncates value to given length"""
return int(str(val)[:length])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not the right way to truncate an int.
The reason for our truncation is so that we don't generate values that will fail on insert when we mask data because they won't fit in the target database. So, e.g. here are mysql types: https://dev.mysql.com/doc/refman/8.0/en/integer-types.html
These are truncated by bytes, so, e.g. if you had a mysql column defined as an INT and you tried to insert a value greater than 2147483647 it would fail.

I think this can be deferred ( for the moment at least - it was never supported in atlas AFAIK) but we shouldn't be truncating alphabetically, in any case.

Copy link
Contributor Author

@eastandwestwind eastandwestwind Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. Would byte truncation be diff for nosql vs sql?

Copy link
Contributor

@stevenbenjamin stevenbenjamin Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know. The thing I posted was mysql specific. All of the nosql databases I've worked with sidestep this issue by basically allowing you to write anything - e.g. dynamo only uses BigDecimals in the java implementation .
I think it shoiuld only matter with defined schema types. I don't think it's hugely common to define things like TINYINT columns (probably mattered more when people were very concerned with field sizes) , but it could happen. Don't know if this has every been an issue with Atlas

@stevenbenjamin stevenbenjamin merged commit bef54bc into main Dec 3, 2021
@pattisdr pattisdr deleted the 70-feat-adds-policy-driven-masking branch February 9, 2022 19:08
sanders41 pushed a commit that referenced this pull request Sep 22, 2022
* adding some types and questions for tech discussion

* adding some types and questions for tech discussion

* removes hmac, aes, and hash from supported strategies for now

* adds supported data types to masking strategies

* adds logger warning if query cannot be generated to do data type configs

* adds support for truncating specific data types

* gets current tests passing

* adds test for length truncation

* adds docs

* cr

* moving supported_data_types inside class

* remove type var

* remove int truncation

* use class name to identify type

Co-authored-by: catherinesmith <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Policy-driven null, random string and preset default string masking types
2 participants