-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Sensitive data masking utility #1858
Comments
Thanks for opening your first issue here! We'll come back to you as soon as we can. |
Prioritising for review tomorrow - missed our typical response time due to team offsite. We'll revert with comments by tomorrow COB |
Hi @seshubaws thank you for your great work on the RFC. I especially liked the way you already pre-tested some solutions to find some bottlenecks. I have two concerns:
Last, one thing I would love to see, is to augment objects that have PII with the corresponding encrypted version. Example:
If we are interested in masking emails, this could turn into
This way, anyone with access to the decryption keys would be able to just augment the object with the original data, instead of having to find it on external systems. |
This is great, I'm so excited we're moving along with this now. A few immediate comments:
On UX:
Super looking forward to this. Thank you for all the hard work you already put into this. |
I'm excited about all the possibilities we can uncover with this feature on the security front! Few things that come to my mind about it:
I'm looking forward to deep dive in this topic and join the team bringing this to life! |
Hey, really liking this idea! |
@rubenfonseca Thank you for your comments, let me know if I haven't answered all your questions!
Yes, that makes sense to me! I played around a little with CW’s new data masking tool and it seems they might be using regex for pattern matching for the desired categories users want to mask, but not exactly sure (let me know if you know what how they implemented this). I’m not very familiar with Pydantic but if it will help users identify more generic PII fields, we should definitely use that.
Yes, I think this goes along with what Heitor mentioned about designing a bring-your-own-masking-provider type interface. As both the AWS Encryption SDK and ItsDangerous have their tradeoffs, we should leave it up to the customer to decide which would fit best with their specific use case. I’ll work on coming up with a UX draft for this.
We could definitely do this! I am a little confused on what use case would require a field to be masked and also encrypted though, as I am under the impression that if it’s encrypted, then it’s just as good as being masked with the added benefit of only users with access can decrypt it. |
Had an offline discussion with @heitorlessa (thanks for the thorough review and comments Heitor!) and thought I'd add it here for the public:
Seshu: So do we want to launch first with customers choosing categories of things to mask like what CW is offering right now? ie categories like Heitor: Categories will prolly have false positives, so I’d wait until customers actually request it so we can brainstorm a bit more. Right now, masking any input that can be unmasked with appropriate permissions later is the way forward — for irreversibly masking (****), I’m still torn on whether we do this now with AST, or simply recommend Pydantic SecretStr for now until other languages have bandwidth to port this utility. Basic UX of this:
This means we can use the same mechanic to provide an actual Obfuscator who can encrypt data at any node in the tree, and customers can deobfuscate by using the same method in reverse (Obfuscator.unmark, etc.). Using the sample example but with SecretStr:
|
@heitorlessa Responding to the rest of your questions here!
I am using only the stdlib ast, but just to call out that python3.9 or later is needed for what we’re using from this library, though I don’t think that should be a problem.
I updated the original proposal above to include this, but had a few questions about the UX I wanted to get clarification on, since I was basing this draft on the one Heitor had proposed earlier in this comment:
I suppose we could use |
Great catch. We will have to support all the current supported Python Lambda runtimes, which at this moment means we have to support python 3.7. Do you know if there's a 3rd party library that we could use so we can cover all Python versions?
One idea was to have common use cases implemented by us in Powertools, so we make it as easier as possible for customers to adapt.
If we can't find a good 3rd party library to replace stdlib ast, Pydantic could be a good option since we already use it elsewhere.
I don't have any strong opinion on this. |
astunparse is a library that also offers the same capabilities as the stdlib ast for all Python versions, and it has a BSD license. |
@seshubaws and I had a 1:1 sync as there were some confusing areas on scope (now vs later). I'd like to take a moment to recap on what we're trying to do and when. There are two macro use cases:
For now, we should focus on the first macro use case - Mask/Unmask primitive data types. Why? Our primary focus is making sure this feature can use AWS Encryption SDK and it is easily swappable for another implementation (e.g., ItsDangerous, Bring-your-own). The second use case is also dependent on the first one. For example, if a customer uses Pydantic (the most popular data validation library), they can start address the second use case by incorporating this feature with a validator. Pydantic will do the heavy lifting of traversing (Node visitor pattern) the model and call this feature to mask/unmask data upon the data validation phase. This gives us time to focus on the UX to solve the actual problem (Mask/Unmask data), give us a window for customer feedback to get extensibility right (batteries included but swappable), and only then take the second use case for those not using Pydantic. This could mean a launch with the first use case completed, followed by the second use case in a separate release - we don't have do it all at once; it's a complex domain and UX is at utmost importance. Please let me know if I can make any of these parts any clearer. |
Please let us know if you have any updates @seshubaws in the coming weeks. |
|
Reopening this issue to update the future changes and the developer experience. |
Quick summary of the experience that's been merged. Take the following sample data in mind: {
"id": 1,
"name": "John Doe",
"age": 30,
"email": "[email protected]",
"address": {
"street": "123 Main St",
"city": "Anytown",
"state": "CA",
"zip": "12345"
}
} We can now use from aws_lambda_powertools.utilities.data_masking import DataMasking
from aws_lambda_powertools.utilities.data_masking.provider.kms.aws_encryption_sdk import AwsEncryptionSdkProvider
import os
encryption_keys = os.getenv("ENCRYPTION_KEYS")
encryption_provider = AwsEncryptionSdkProvider(keys=[encryption_keys])
data_masker = DataMasking(provider=encryption_provider)
def lambda_handler(event: dict, context):
encrypted: dict = data_masker.encrypt(data=event, fields=["email", "address.street"])
# {
# "id": 1,
# "name": "John Doe",
# "age": 30,
# "email": "InRoaXMgaXMgYSBzdHJpbmciHsLZGx2na-XzP_TB5Bf2LNU1bLc",
# "address": {
# "street": "XMgYSB_KDddaDJYMb-JpbmGnagTklwQ-msdaDLP",
# "city": "Anytown",
# "state": "CA",
# "zip": "12345"
# },
# } Conversely, the decrypt operation could work in a separate function: from aws_lambda_powertools.utilities.data_masking import DataMasking
from aws_lambda_powertools.utilities.data_masking.provider.kms.aws_encryption_sdk import AwsEncryptionSdkProvider
import os
encryption_keys = os.getenv("ENCRYPTION_KEYS")
encryption_provider = AwsEncryptionSdkProvider(keys=[encryption_keys])
data_masker = DataMasking(provider=encryption_provider)
def lambda_handler(event, context):
decrypted = data_masker.decrypt(data=event, fields=["email", "address.street"])
# {
# "id": 1,
# "name": "John Doe",
# "age": 30,
# "email": "[email protected]",
# "address": {
# "street": "123 Main St",
# "city": "Anytown",
# "state": "CA",
# "zip": "12345"
# },
# } |
This is now released under 2.26.0 version! |
Reopening |
Closing as code is merged; docs is in progress. that should better reflect what's pending on the board. |
|
Is this related to an existing feature request or issue?
#1173
Which AWS Lambda Powertools utility does this relate to?
Other
Summary
Customers would like to obfuscate incoming data for known fields that contain PII, so that they're not passed downstream or accidentally logged. With the increase of batch processing utilities and GDPR, this is one of the hardest tasks for customers today, specially when considering multi-account users.
AWS Encryption SDK is a good starting point but it can be too complex for the average developer, data engineer, or DevOps persona to use. As such, it is highly requested that the Powertools library have a utility to easily mask and/or encrypt sensitive data.
Use case
The use case for this utility would be for developers who want to mask or encrypt sensitive data such as names, addresses, SSNs, etc. in order for them to not be logged in CloudWatch so such data is not compromised, and so that downstream systems like S3, DynamoDB, RDBMS, etc. will not need any additional work on handling PII data.
Additionally, developers should be able to recover encrypted sensitive data to its original form so that they can handle sensitive requests around that data on a as-needed basis.
Proposal
The data masking utility should allow users to mask data, or encrypt and decrypt it. If they would like to encrypt their data, customers should be able to decide for themselves which encryption provider they want to use, though we will provide an out-of-the-box integration with the AWS Encryption SDK. The below code snippet is a rudimentary look at how this utility can be used and how it will function.
Usage
AWS Encryption SDK
The AWS Encryption SDK is a client-side encryption library that makes it easier to encrypt and decrypt data of any type in your application. The Encryption SDK is available in all the languages that Powertools supports. You can use it with customer master keys in AWS Key Management Service (AWS KMS), though the library does not require any AWS service. When you encrypt data, the SDK returns a single, portable encrypted message that includes the encrypted data and encrypted data keys. This object is designed to work in many different types of applications. You can specify many of the encryption options, including selecting an encryption and signing algorithm.
Latencies
Latencies for using this utility with the AWS Encryption SDK in Lambda functions configured with 128MB, 1024MB, and 1769MB, respectively.
Custom encryption
If customer would like to use another encryption provider, or define their own encrypt and decrypt functions, we will define an interface that the customer can implement and pass in to the DataMaskingUtility class.
Out of scope
Traversing an arbitrary dictionary will be out of scope for the initial launch of this tool. This feature will be to receive instructions as to where in the given dictionary it should mask/unmask the data.
We still need to determine the most efficient method of taking input JSON path masking or encrypting the value at that path. JMESPath or JSON path can be considered for simple use cases but we need to find the fastest method.
Potential challenges
encrypt
anddecrypt
even in the case where the customer only masks and won't be able to decrypt later? Should we have amask
method in the interface that is always accessible so that users can irretrievably mask some data and also encrypt some data?Dependencies and Integrations
Integration with the AWS Encryption SDK.
Alternative solutions
No response
Acknowledgment
The text was updated successfully, but these errors were encountered: