flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> master #4346
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously with unicode byte sequences such as
0xef 0xbf 0x00 ...
Fluent Bit would blindly trust the first unicode byte 0xef to describe
how many valid trailing unicode bytes to copy.
If a trailing unicode byte is invalid, such as 0x00, the null character,
the utility blindly copied this to the escaped string.
This commit adds checks for leading and trailing byte utf-8 compliance.
If invalid, the ill-formed character's bytes are individually mapped to
private use area [U+E000 to U+E0FF] preserving ill-formed character data
in a compact and safe utf-8 friendly format.
Signed-off-by: Matthew Fala [email protected]
Final version end to end tested with the following:
This code is added just before flb_utils_write_str is called in the cloudwatch plugin. It adds a null character at the start of the first character (actually the second character, because the first character in our test is a double quote)
Send a log starting with the smiley face emoji via HTTP input plugin
Check cloudwatch to see if the logs have successfully made the way through (though corrupted)
The code successfully sanitizes the corrupt unicode and maps the invalid smiley face bytes to a safe private use area utf-8 region
Data flow
For discussion on this commit, please see PR: #4297
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.