Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: simple JSON Schema validator utility #153

Merged
merged 20 commits into from
Sep 18, 2020

Conversation

heitorlessa
Copy link
Contributor

@heitorlessa heitorlessa commented Aug 31, 2020

Issue #, if available: #95, and related to #147

Description of changes:

Checklist

  • Meet tenets criteria
  • Update tests
  • Update docs
  • PR title follows conventional commit semantics
  • validator decorator for both inbound/outbound schema
  • validate standalone function
  • support JMESPath for flexible envelope expressions (a.k.a data selectors)
  • make jmespath an optional dependency (pip install aws-lambda-powertools[jmespath])
  • Docstrings
  • Double check whether we want to throw data into exception as it might contain sensitive data
    • Trade-off on having helpful exception
  • Experiment w/ closure/cache envelope expr
    • maybe premature optimization given it's quite fast (need benchmark)
  • Bring envelopes for popular event sources to retrieve data
    • API Gateway REST API
    • API Gateway HTTP API
    • SQS
    • SNS
    • EventBridge
    • Kinesis
    • CloudWatch Events Scheduled Event
    • CloudWatch Logs
  • Custom JMESPath function for JSON deserialization
  • Custom JMESPath function for Base64 decoding
  • Custom JMESPath function for Base64 and gzip decompression (cloudwatch logs)

Open questions

None

UX

Decorator

from aws_lambda_powertools.utilities.validation import validate, validator

hello_schema = {
    "$schema": "http://json-schema.org/draft-07/schema",
    "$id": "http://example.com/example.json",
    "type": "object",
    "title": "The root schema",
    "description": "The root schema comprises the entire JSON document.",
    "examples": [
        {
            "message": "hello world",
            "username": "lessa"
        }
    ],
    "required": [
        "message",
        "username"
    ],
    "properties": {
        "message": {
            "$id": "#/properties/message",
            "type": "string",
            "title": "The message schema",
            "description": "Message field",
        },
        "username": {
            "$id": "#/properties/username",
            "type": "string",
            "title": "The username schema",
            "description": "Username field",
        }
    },
    "additionalProperties": True
}

sample_wrapped_event = {
    "data": {
        "payload": {
            "message": "hello hello",
            "username": "blah blah"
        }
    }
}

# Either Inbound or Outbound schema, or both
# Envelope uses JMESPath expression to extract 
## only what data should be validated against schema
@validator(inbound_schema=hello_schema, envelope="data.payload")
def handler(evt, ctx):
    ...

Deserializing JSON Strings before validation with powertools_json

We'll provide a function to deserialize JSON Strings that are commonly present in API Gateway events as well as any arbitrary payload. This JMESPath function will deserialize any value you pass as an argument.

from aws_lambda_powertools.utilities.validation import validate, validator

sample_wrapped_event_json_string = {
    "data": json.dumps({"payload": {"message": "hello hello", "username": "blah blah"}})
}

@validator(inbound_schema=hello_schema, envelope="powertools_json(data).payload")
def handler(evt, ctx):
    ...

Deserializing Base64 JSON Strings before validation with powertools_base64 and powertools_json

We'll provide a function to decode base64 data that can be used in tandem with our function to deserialize JSON Strings - This is typically the case for events coming from Kinesis Records, CloudWatch Logs, etc.

An example of a raw JMESPath expression to decode and deserialize JSON Strings sent to Kinesis before applying validation - The validation has to take into account an array of the data, in this case

from aws_lambda_powertools.utilities.validation import validate, validator

sample_wrapped_event_base64_json_string = {
    "data": "eyJtZXNzYWdlIjogImhlbGxvIGhlbGxvIiwgInVzZXJuYW1lIjogImJsYWggYmxhaCJ9=")
}

@validator(inbound_schema=hello_schema, envelope=envelopes.KINESIS_DATA_STREAM) # when the event is a Kinesis record (array)
@validator(inbound_schema=hello_schema, envelope="powertools_base64(data) | powertools_json(@)") # sample event above
def handler(evt, ctx):
    ...

Decompressing and deserializing GZIP Base64 JSON Strings before validation with powertools_base64_gzip

AWS services like CloudWatch Logs compress log events as ZIP archives and base64 its binary content - Similar to base64 and JSON deserialization above, we'll provide a powertools_base64_gzip function.

An example of a CloudWatch Logs event already compressed to demonstrate this function:

from aws_lambda_powertools.utilities.validation import validate, validator

sample_wrapped_event_base64_json_string = {
    "awslogs": {
        "data": "H4sIACZAXl8C/52PzUrEMBhFX2UILpX8tPbHXWHqIOiq3Q1F0ubrWEiakqTWofTdTYYB0YWL2d5zvnuTFellBIOedoiyKH5M0iwnlKH7HZL6dDB6ngLDfLFYctUKjie9gHFaS/sAX1xNEq525QxwFXRGGMEkx4Th491rUZdV3YiIZ6Ljfd+lfSyAtZloacQgAkqSJCGhxM6t7cwwuUGPz4N0YKyvO6I9WDeMPMSo8Z4Ca/kJ6vMEYW5f1MX7W1lVxaG8vqX8hNFdjlc0iCBBSF4ERT/3Pl7RbMGMXF2KZMh/C+gDpNS7RRsp0OaRGzx0/t8e0jgmcczyLCWEePhni/23JWalzjdu0a3ZvgEaNLXeugEAAA=="
    }
}

@validator(inbound_schema=hello_schema, envelope=envelopes.CLOUDWATCH_LOGS) # when the event is a CloudWatch Logs record (array)
@validator(inbound_schema=hello_schema, envelope="awslogs.powertools_base64_gzip(data) | powertools_json(@).logEvents[*]") # sample event above
def handler(evt, ctx):
    ...

Standalone function

from aws_lambda_powertools.utilities.validation import validate, validator

sample_wrapped_event_json_string = {
    "data": json.dumps({"payload": {"message": "hello hello", "username": "blah blah"}})
}

def handler(evt, ctx):
    validate(event=evt, schema=hello_schema, envelope="powertools_json(data).payload")

Built-in envelopes

from aws_lambda_powertools.utilities.validation import validator, envelopes

sample_wrapped_event_json_string = {
    "data": json.dumps({"message": "hello hello", "username": "blah blah"})
}

@validator(inbound_schema=hello_schema, envelope=envelopes.API_GATEWAY_REST)
def handler(evt, ctx):
    return True

Breaking change checklist

RFC issue #:

  • Migration process documented
  • Implement warnings (if it can live side by side)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@heitorlessa heitorlessa added area/utilities feature New feature or functionality labels Aug 31, 2020
@codecov-commenter
Copy link

codecov-commenter commented Aug 31, 2020

Codecov Report

Merging #153 into develop will decrease coverage by 0.20%.
The diff coverage is 97.43%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #153      +/-   ##
===========================================
- Coverage   100.00%   99.79%   -0.21%     
===========================================
  Files           33       39       +6     
  Lines          918      996      +78     
  Branches        77       83       +6     
===========================================
+ Hits           918      994      +76     
- Misses           0        1       +1     
- Partials         0        1       +1     
Impacted Files Coverage Δ
...ambda_powertools/utilities/validation/validator.py 90.47% <90.47%> (ø)
...lambda_powertools/utilities/validation/__init__.py 100.00% <100.00%> (ø)
aws_lambda_powertools/utilities/validation/base.py 100.00% <100.00%> (ø)
...ambda_powertools/utilities/validation/envelopes.py 100.00% <100.00%> (ø)
...mbda_powertools/utilities/validation/exceptions.py 100.00% <100.00%> (ø)
...ertools/utilities/validation/jmespath_functions.py 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 063d8cd...f7f3c99. Read the comment docs.

@heitorlessa heitorlessa marked this pull request as draft August 31, 2020 20:57
@jplock
Copy link

jplock commented Sep 1, 2020

I like this a lot 🎊

@nmoutschen
Copy link
Contributor

nmoutschen commented Sep 1, 2020

Idea on how to handle API Gateway events and other events that might contain serialized data for the envelope:

We could add a function to JMESPath named powertools_json that will take a string as input and deserialize it. This would allow to use these expressions as envelopes:

"powertools_json(body)" -> Deserialize the body and return it
"body" -> Keep the body as-is (e.g. string)
"powertools_json(body).payload" -> Deserialize the body and return the value for the payload key

# We could have nested JSON bodies
"powertools_json(body).payload | powertools_json(@)"

Why prefix with powertools? The documentation for JMESPath states that "If you have a custom function that you’ve found useful, consider submitting it to jmespath.site and propose that it be added to the JMESPath language", therefore new functions could be added over time. If JMESPath adds a "json()" function and this function behaves in a slightly different way, there should be no conflict with the Powertools implementation.

@heitorlessa
Copy link
Contributor Author

@nmoutschen @cakepietoast could you have a pass at the UX and help me answer the questions I put at the description, please?

:)

@nmoutschen
Copy link
Contributor

Overall the UX looks good. Quick questions:

  • Will this also support outbound validation through the same decorator?
  • (Nitpick) Should the envelope for API_GATEWAY_REST should be v1? Since HTTP API supports both v1 and v2.

@heitorlessa
Copy link
Contributor Author

heitorlessa commented Sep 2, 2020 via email

@heitorlessa
Copy link
Contributor Author

heitorlessa commented Sep 3, 2020

Update based on discussions we had on Slack

JMESPath as a project dependency

Initially started as an optional dependency, but we'll need a standard across Powertools as customers request an utility to extract as well as validate their events against JSONSchema.

JMESPath is literally 24KB as it uses AST and a Lexer, so there is no impact on an additional dep, nor at runtime. It also provides the flexibility customers want to use their own JMESPath expressions to work with their events in the same way they're used with AWS CLI and other related projects -- Making it a standard.

Custom function to deserialize base64 data

Kinesis is a popular event sources and customers will not be able to validate JSON payloads without decoding it first - This problem also occurs when working with the event, where customers need to create a for loop and decode the event before working with it.

We'll create a custom JMESPath function named powertools_base64 to help address this while making it easier to maintain and remove complexity from customers. They'll be able to either use envelope=envelopes.KINESIS_DATA_STREAM or use the function for any base64 payload via envelope="powertools_base64(key)"

@nmoutschen @cakepietoast need your public sign-off to this for historical reasons :)

@nmoutschen
Copy link
Contributor

If you meant envelope="powertools_base64(key)" then I'm OK with it :)

@to-mc
Copy link
Contributor

to-mc commented Sep 3, 2020

@nmoutschen @cakepietoast need your public sign-off to this for historical reasons :)

You have it!

@brysontyrrell
Copy link

This is shockingly similar to code we wrote for our APIs to handle schema validations.

Question on use as a decorator: it looks like there's nothing to handle validator exceptions so a 4xx response could be returned instead of a 5xx. Is the intended use case here to use middleware to wrap the handler so those exceptions can be handled?

@nmoutschen
Copy link
Contributor

nmoutschen commented Sep 4, 2020

@brysontyrrell This utility should support both sync and async workflows. The error handling would be different between the two. And for sync, it could be AppSync, ALB, API Gateway, etc. where the return values might vary.

I think the use-case where someone would like to use the decorator over the handler is for cases where the result should be the Lambda execution failing. In async cases, this would trigger the OnFailure/DLQ/etc. flow.

For API Gateway cases, I think this could be done like this:

def handler(evt, ctx):
    try:
        validate(event=evt, schema=hello_schema, envelope="powertools_json(data).payload")
    except ValidationException:
        return {
            "statusCode": 400,
            "body": "Invalid payload"
        }

The ValidationException would come from validation errors, while internal errors would result in an invocation failure.

@heitorlessa
Copy link
Contributor Author

@brysontyrrell exactly what Nicolas said - It's also a common problem so solutions aren't going to be that much different.

I've just got JSON Schema validation against Kinesis Data Stream records seamlessly -- UX will look like this

@validator(inbound_schema=hello_schema, envelope=envelopes.KINESIS_DATA_STREAM)
def handler_base64_event(evt, ctx):
    ...

What's happening here: KINESIS_DATA_STREAM envelope is a JMESPath expression that uses our custom functions to decode and deserialize JSON Strings encoded as base64

  • Expression looks like this: KINESIS_DATA_STREAM = "Records[*].kinesis.powertools_base64(data) | powertools_json(@)"
  • Where powertools_base64 is a custom function we'll provide to do the heavy lifting of decoding any record customers want, and deserializing those decoded records that are typically JSON Strings into actual Dicts

From a JSON Schema perspective, it'll have to take into account an array of data, other than that it's pretty seamless IMHO.

What do you think @jplock @nmoutschen @michaelbrewer @cakepietoast

* develop: (57 commits)
  chore: bump version to 1.5.0 (#158)
  chore(batch): Housekeeping for recent changes (#157)
  docs: address readability feedbacks
  chore: add debug logging for sqs batch processing
  chore: remove middlewares module, moving decorator functionality to base and sqs
  docs: add detail to batch processing
  fix: throw exception by default if messages processing fails
  chore: add test for sqs_batch_processor interface
  fix: add sqs_batch_processor as its own method
  docs: simplify documentation more SQS specific focus Update for sqs_batch_processor interface
  chore: add sqs_batch_processor decorator to simplify interface
  feat(parameters): transform = "auto" (#133)
  chore: fix typos, docstrings and type hints (#154)
  chore: tiny changes for readability
  fix: ensure debug log event has latest ctx
  docs: rephrase the wording to make it more clear
  docs: refactor example; improve docs about creating your own processor
  refactor: remove references to BaseProcessor. Left BasePartialProcessor
  docs: add newly created Slack Channel
  fix: update image with correct sample
  ...
@heitorlessa heitorlessa marked this pull request as ready for review September 16, 2020 17:34
@to-mc to-mc self-requested a review September 18, 2020 10:50
@heitorlessa heitorlessa merged commit 5ee14b1 into aws-powertools:develop Sep 18, 2020
@heitorlessa heitorlessa deleted the feat/validator-utility branch September 18, 2020 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants