Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serverless] Fix flaky integration tests and make them more easily maintainable. #14783

Merged
merged 31 commits into from
Dec 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
cb84b9d
Move log normalization to python script.
purple4reina Dec 9, 2022
68be51b
Write normalizers for proper parsing.
purple4reina Dec 15, 2022
b3aeac5
Sort logs by message string.
purple4reina Dec 15, 2022
93eb732
Update log-csharp.
purple4reina Dec 15, 2022
9d93d31
Update log-go.
purple4reina Dec 15, 2022
b13427d
Update log-java.
purple4reina Dec 15, 2022
f9e6fe4
Update log-python.
purple4reina Dec 15, 2022
9337b15
Update trace-go.
purple4reina Dec 15, 2022
8d135a0
Update trace-java.
purple4reina Dec 15, 2022
31a97c0
Update trace-python.
purple4reina Dec 15, 2022
1143474
Capture errors and print.
purple4reina Dec 16, 2022
d36d54d
Reenable some tests.
purple4reina Dec 16, 2022
a480832
Update rest of snapshots, so no tests need to be skipped.
purple4reina Dec 16, 2022
a646b27
Always return a json parsable string.
purple4reina Dec 16, 2022
c1f7cff
Improved error handling.
purple4reina Dec 16, 2022
75aeef7
Update account number.
purple4reina Dec 16, 2022
79af904
Remove sed backup file.
purple4reina Dec 16, 2022
f136b32
Changing XXXXXX to STAGE for easier merge.
purple4reina Dec 16, 2022
1552796
Merge branch 'serverless/main' of github.com:DataDog/datadog-agent in…
purple4reina Dec 16, 2022
60418f2
Change back to XXXXXX.
purple4reina Dec 16, 2022
b4c4f02
Remove extra stuff.
purple4reina Dec 16, 2022
66afe4a
Fix flake8 issues.
purple4reina Dec 17, 2022
68a7936
Make linter happy.
purple4reina Dec 17, 2022
98b8d5c
Re-sort metrics.
purple4reina Dec 17, 2022
637a814
Sort metrics by name, then cold start.
purple4reina Dec 17, 2022
99a8b01
Update sort of metrics.
purple4reina Dec 17, 2022
d3ae4e8
Keep errors consistent to avoid flaky failure tests.
purple4reina Dec 17, 2022
b45662b
Dict instead of string.
purple4reina Dec 17, 2022
1f3ac48
Change error message.
purple4reina Dec 19, 2022
9fa309d
Remove unused var.
purple4reina Dec 19, 2022
91e6c05
Linting error.
purple4reina Dec 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions test/integration/serverless/log_normalize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
import argparse
import json
import re


def normalize_metrics(stage):
return [
replace(r'raise Exception', r'\n'),
require(r'BEGINMETRIC.*ENDMETRIC'),
exclude(r'BEGINMETRIC'),
exclude(r'ENDMETRIC'),
replace(r'(ts":)[0-9]{10}', r'\1XXX'),
replace(r'(min":)[0-9\.e\-]{1,30}', r'\1XXX'),
replace(r'(max":)[0-9\.e\-]{1,30}', r'\1XXX'),
replace(r'(cnt":)[0-9\.e\-]{1,30}', r'\1XXX'),
replace(r'(avg":)[0-9\.e\-]{1,30}', r'\1XXX'),
replace(r'(sum":)[0-9\.e\-]{1,30}', r'\1XXX'),
replace(r'(k":\[)[0-9\.e\-]{1,30}', r'\1XXX'),
replace(r'(datadog-nodev)[0-9]+\.[0-9]+\.[0-9]+', r'\1X.X.X'),
replace(r'(datadog_lambda:v)[0-9]+\.[0-9]+\.[0-9]+', r'\1X.X.X'),
replace(r'dd_lambda_layer:datadog-go[0-9.]{1,}', r'dd_lambda_layer:datadog-gox.x.x'),
replace(r'(dd_lambda_layer:datadog-python)[0-9_]+\.[0-9]+\.[0-9]+', r'\1X.X.X'),
replace(r'(serverless.lambda-extension.integration-test.count)[0-9\.]+', r'\1'),
replace(r'(architecture:)(x86_64|arm64)', r'\1XXX'),
replace(stage, 'XXXXXX'),
exclude(r'[ ]$'),
sort_by(lambda log: (log["metric"], "cold_start:true" in log["tags"])),
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the tree different telemetry types, you'll see I've created a list of "normalizers". These normalizers each make a single small change to the logs data that is fed in. It will be easy for us to edit these normalizer lists and can add more complicated normalizers in the future.



def normalize_logs(stage):
return [
require(r'BEGINLOG.*ENDLOG'),
exclude(r'BEGINLOG'),
exclude(r'ENDLOG'),
replace(r'("timestamp":\s*?)\d{13}', r'\1"XXX"'),
replace(r'\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}:\d{3}', 'TIMESTAMP'),
replace(r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z', 'TIMESTAMP'),
replace(r'\d{4}\/\d{2}\/\d{2}\s\d{2}:\d{2}:\d{2}', 'TIMESTAMP'),
replace(r'\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}', 'TIMESTAMP'),
replace(r'([a-zA-Z0-9]{8}-[a-zA-Z0-9]{4}-[a-zA-Z0-9]{4}-[a-zA-Z0-9]{4}-[a-zA-Z0-9]{12})', r'XXX'),
replace(stage, 'XXXXXX'),
replace(r'(architecture:)(x86_64|arm64)', r'\1XXX'),
sort_by(lambda log: log["message"]["message"]),
# ignore a Lambda error that occurs sporadically for log-csharp see here for more info:
# https://repost.aws/questions/QUq2OfIFUNTCyCKsChfJLr5w/lambda-function-working-locally-but-crashing-on-aws
# TODO
# perl -n -e "print unless /LAMBDA_RUNTIME Failed to get next invocation. No Response from endpoint/ or \
# /An error occurred while attempting to execute your code.: LambdaException/ or \
# /terminate called after throwing an instance of 'std::logic_error'/ or \
# /basic_string::_M_construct null not valid/" |
]


def normalize_traces(stage):
return [
require(r'BEGINTRACE.*ENDTRACE'),
exclude(r'BEGINTRACE'),
exclude(r'ENDTRACE'),
replace(r'(ts":)[0-9]{10}', r'\1XXX'),
replace(r'((startTime|endTime|traceID|trace_id|span_id|parent_id|start|system.pid)":)[0-9]+', r'\1null'),
replace(r'((tracer_version|language_version)":)["a-zA-Z0-9~\-\.\_]+', r'\1null'),
replace(r'(duration":)[0-9]+', r'\1null'),
replace(r'((datadog_lambda|dd_trace)":")[0-9]+\.[0-9]+\.[0-9]+', r'\1X.X.X'),
replace(r'(,"request_id":")[a-zA-Z0-9\-,]+"', r'\1null"'),
replace(r'(,"runtime-id":")[a-zA-Z0-9\-,]+"', r'\1null"'),
replace(r'(,"system.pid":")[a-zA-Z0-9\-,]+"', r'\1null"'),
replace(r'("_dd.no_p_sr":)[0-9\.]+', r'\1null'),
replace(r'("architecture":)"(x86_64|arm64)"', r'\1"XXX"'),
replace(r'("process_id":)[0-9]+', r'\1null'),
replace(stage, 'XXXXXX'),
exclude(r'[ ]$'),
]


#####################
# BEGIN NORMALIZERS #
#####################
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each normalizer is a function that returns a function. The returned function takes a single argument, a log string, and returns a mutated log string.



def replace(pattern, repl):
comp = re.compile(pattern, flags=re.DOTALL)

def _replace(log):
return comp.sub(repl, log)

return _replace


def exclude(pattern):
return replace(pattern, '')


def require(pattern):
comp = re.compile(pattern, flags=re.DOTALL)

def _require(log):
match = comp.search(log)
if not match:
return ''
return match.group(0)

return _require


def sort_by(key):
def _sort(log):
log_json = json.loads(log, strict=False)
log_sorted = sorted(log_json, key=key)
return json.dumps(log_sorted)

return _sort
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This normalizer is where the magic is.

You'll remember that most of the failures we receive in our integration tests are just logs coming in out of order. To address this, I've created a sort normalizer which will sort the logs by their messages, instead of their timestamps.

I believe this change in sort order is safe. The backend will always sort logs in their correct order based on their timestamp. So, we here do not actually need to worry about their sort order.



###################
# END NORMALIZERS #
###################


def normalize(log, typ, stage):
for normalizer in get_normalizers(typ, stage):
log = normalizer(log)
return format_json(log)


def get_normalizers(typ, stage):
if typ == 'metrics':
return normalize_metrics(stage)
elif typ == 'logs':
return normalize_logs(stage)
elif typ == 'traces':
return normalize_traces(stage)
else:
raise ValueError(f'invalid type "{typ}"')


def format_json(log):
return json.dumps(json.loads(log, strict=False), indent=2)


def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--type', required=True)
parser.add_argument('--logs', required=True)
parser.add_argument('--stage', required=True)
return parser.parse_args()


if __name__ == '__main__':
try:
args = parse_args()
print(normalize(args.logs, args.type, args.stage))
except Exception:
err = {"error": "normalization raised exception"}
err_json = json.dumps(err, indent=2)
print(err_json)
exit(1)
17 changes: 0 additions & 17 deletions test/integration/serverless/parse-json.js

This file was deleted.

102 changes: 8 additions & 94 deletions test/integration/serverless/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -154,20 +154,7 @@ all_functions=("${metric_functions[@]}" "${log_functions[@]}" "${trace_functions

# Add a function to this list to skip checking its results
# This should only be used temporarily while we investigate and fix the test
functions_to_skip=(
# Tagging behavior after a timeout is currently known to be flaky
"timeout-node"
"timeout-python"
"timeout-java"
"timeout-go"
"timeout-csharp"
"timeout-proxy"
"trace-csharp" # Will be reactivated when the new dotnet layer will be released
"trace-proxy" # Will be reactivated when sampling with proxy will be implemented
"error-proxy"
"log-proxy"
"metric-proxy"
)
functions_to_skip=()

echo "Invoking functions for the first time..."
set +e # Don't exit this script if an invocation fails or there's a diff
Expand Down Expand Up @@ -215,95 +202,22 @@ for function_name in "${all_functions[@]}"; do

# Replace invocation-specific data like timestamps and IDs with XXX to normalize across executions
if [[ " ${metric_functions[*]} " =~ " ${function_name} " ]]; then
# Normalize metrics
logs=$(
echo "$raw_logs" |
perl -p -e "s/raise Exception/\n/g" |
grep -v "BEGINLOG.*" |
grep -v "BEGINTRACE.*" |
grep "BEGINMETRIC.*" |
perl -p -e "s/BEGINMETRIC/\1/g" |
perl -p -e "s/ENDMETRIC/\1/g" |
perl -p -e "s/(ts\":)[0-9]{10}/\1XXX/g" |
perl -p -e "s/(min\":)[0-9\.e\-]{1,30}/\1XXX/g" |
perl -p -e "s/(max\":)[0-9\.e\-]{1,30}/\1XXX/g" |
perl -p -e "s/(cnt\":)[0-9\.e\-]{1,30}/\1XXX/g" |
perl -p -e "s/(avg\":)[0-9\.e\-]{1,30}/\1XXX/g" |
perl -p -e "s/(sum\":)[0-9\.e\-]{1,30}/\1XXX/g" |
perl -p -e "s/(k\":\[)[0-9\.e\-]{1,30}/\1XXX/g" |
perl -p -e "s/(datadog-nodev)[0-9]+\.[0-9]+\.[0-9]+/\1X\.X\.X/g" |
perl -p -e "s/(datadog_lambda:v)[0-9]+\.[0-9]+\.[0-9]+/\1X\.X\.X/g" |
perl -p -e "s/dd_lambda_layer:datadog-go[0-9.]{1,}/dd_lambda_layer:datadog-gox.x.x/g" |
perl -p -e "s/(dd_lambda_layer:datadog-python)[0-9_]+\.[0-9]+\.[0-9]+/\1X\.X\.X/g" |
perl -p -e "s/(serverless.lambda-extension.integration-test.count)[0-9\.]+/\1/g" |
perl -p -e "s/(architecture:)(x86_64|arm64)/\1XXX/g" |
perl -p -e "s/$stage/XXXXXX/g" |
perl -p -e "s/[ ]$//g" |
node parse-json.js
)
norm_type=metrics
elif [[ " ${log_functions[*]} " =~ " ${function_name} " ]]; then
# Normalize logs
logs=$(
echo "$raw_logs" |
grep "BEGINLOG" |
grep -v "BEGINMETRIC.*" |
grep -v "BEGINTRACE.*" |
perl -p -e "s/BEGINLOG/\1/g" |
perl -p -e "s/ENDLOG/\1/g" |
perl -p -e "s/(\"timestamp\": )\d{13}/\1\"XXX\"/g" |
perl -p -e "s/(\"timestamp\": )\d{13}/\1\"XXX\"/g" |
perl -p -e "s/\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}:\d{3}/TIMESTAMP/g" |
perl -p -e "s/\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z/TIMESTAMP/g" |
perl -p -e "s/\d{4}\/\d{2}\/\d{2}\s\d{2}:\d{2}:\d{2}/TIMESTAMP/g" |
perl -p -e "s/\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}/TIMESTAMP/g" |
perl -p -e "s/\"timestamp\":\d{13},/\1/g" |
perl -p -e "s/([a-zA-Z0-9]{8}\-[a-zA-Z0-9]{4}\-[a-zA-Z0-9]{4}\-[a-zA-Z0-9]{4}\-[a-zA-Z0-9]{12})/\0\XXX/g" |
perl -p -e "s/$stage/STAGE/g" |
perl -p -e "s/(architecture:)(x86_64|arm64)/\1XXX/g" |
# ignore a Lambda error that occurs sporadically for log-csharp
# see here for more info: https://repost.aws/questions/QUq2OfIFUNTCyCKsChfJLr5w/lambda-function-working-locally-but-crashing-on-aws
perl -n -e "print unless /LAMBDA_RUNTIME Failed to get next invocation. No Response from endpoint/ or \
/An error occurred while attempting to execute your code.: LambdaException/ or \
/terminate called after throwing an instance of 'std::logic_error'/ or \
/basic_string::_M_construct null not valid/" |
node parse-json.js
)
norm_type=logs
else
# Normalize traces
logs=$(
echo "$raw_logs" |
grep "BEGINTRACE" |
grep -v "BEGINMETRIC.*" |
grep -v "BEGINLOG.*" |
perl -p -e "s/BEGINTRACE/\1/g" |
perl -p -e "s/ENDTRACE/\1/g" |
perl -p -e "s/(ts\":)[0-9]{10}/\1XXX/g" |
perl -p -e "s/((startTime|endTime|traceID|trace_id|span_id|parent_id|start|system.pid)\":)[0-9]+/\1null/g" |
perl -p -e "s/((tracer_version|language_version)\":)[\"a-zA-Z0-9~\-\.\_]+/\1null/g" |
perl -p -e "s/(duration\":)[0-9]+/\1null/g" |
perl -p -e "s/((datadog_lambda|dd_trace)\":\")[0-9]+\.[0-9]+\.[0-9]+/\1X\.X\.X/g" |
perl -p -e "s/(,\"request_id\":\")[a-zA-Z0-9\-,]+\"/\1null\"/g" |
perl -p -e "s/(,\"runtime-id\":\")[a-zA-Z0-9\-,]+\"/\1null\"/g" |
perl -p -e "s/(,\"system.pid\":\")[a-zA-Z0-9\-,]+\"/\1null\"/g" |
perl -p -e "s/(\"_dd.no_p_sr\":)[0-9\.]+/\1null/g" |
perl -p -e "s/(\"architecture\":)\"(x86_64|arm64)\"/\1\"XXX\"/g" |
perl -p -e "s/(\"process_id\":)[0-9]+/\1null/g" |
perl -p -e "s/$stage/XXXXXX/g" |
perl -p -e "s/[ ]$//g" |
node parse-json.js
)
norm_type=traces
fi
logs=$(python3 log_normalize.py --type $norm_type --logs "$raw_logs" --stage $stage)

function_snapshot_path="./snapshots/${function_name}"

jsonLogs="$(echo $logs | node parse-json.js)"

if [ ! -f "$function_snapshot_path" ]; then
printf "${MAGENTA} CREATE ${END_COLOR} $function_name\n"
echo "$jsonLogs" >"$function_snapshot_path"
echo "$logs" >"$function_snapshot_path"
elif [ "$UPDATE_SNAPSHOTS" == "true" ]; then
printf "${MAGENTA} UPDATE ${END_COLOR} $function_name\n"
echo "$jsonLogs" > "$function_snapshot_path"
echo "$logs" > "$function_snapshot_path"
else
if [[ " ${functions_to_skip[*]} " =~ " ${function_name} " ]]; then
printf "${YELLOW} SKIP ${END_COLOR} $function_name\n"
Expand All @@ -314,7 +228,7 @@ for function_name in "${all_functions[@]}"; do
printf "${YELLOW} SKIP ${END_COLOR} $function_name, no .NET support on arm64\n"
continue
fi
diff_output=$(echo "$jsonLogs" | diff - "$function_snapshot_path")
diff_output=$(echo "$logs" | diff - "$function_snapshot_path")
if [ $? -eq 1 ]; then
failed_functions+=("$function_name")

Expand Down
Loading