Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> master #4346

Merged

Conversation

matthewfala
Copy link
Contributor

@matthewfala matthewfala commented Nov 18, 2021

Previously with unicode byte sequences such as

0xef 0xbf 0x00 ...

Fluent Bit would blindly trust the first unicode byte 0xef to describe
how many valid trailing unicode bytes to copy.
If a trailing unicode byte is invalid, such as 0x00, the null character,
the utility blindly copied this to the escaped string.

This commit adds checks for leading and trailing byte utf-8 compliance.
If invalid, the ill-formed character's bytes are individually mapped to
private use area [U+E000 to U+E0FF] preserving ill-formed character data
in a compact and safe utf-8 friendly format.

Signed-off-by: Matthew Fala [email protected]

Final version end to end tested with the following:

This code is added just before flb_utils_write_str is called in the cloudwatch plugin. It adds a null character at the start of the first character (actually the second character, because the first character in our test is a double quote)

        // Corrupt log utf-8
        unsigned char* c1 = tmp_buf_ptr + 0;
        unsigned char* c2 = tmp_buf_ptr + 1;
        unsigned char* c3 = tmp_buf_ptr + 2;
        *c2 = 0;

Send a log starting with the smiley face emoji via HTTP input plugin

[
    "😀this is a small regular \u0000 log."
]

Check cloudwatch to see if the logs have successfully made the way through (though corrupted)

"�this is a small regular \u0000 log."

The code successfully sanitizes the corrupt unicode and maps the invalid smiley face bytes to a safe private use area utf-8 region

Data flow

😀 = [0xf0 0x9f 0x98 0x80] -> [0x00 0x9f 0x98 0x80] -> � = [ \u0000, 0xee 0x82 0x9f, 0xee 0x82 0x98, 0xee 0x82 0x80] -> cloudwatch api

For discussion on this commit, please see PR: #4297


Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Previously with unicode byte sequences such as

   0xef 0xbf 0x00 ...

Fluent Bit would blindly trust the first unicode byte 0xef to describe
how many valid trailing unicode bytes to copy.
If a trailing unicode byte is invalid, such as 0x00, the null character,
the utility blindly copied this to the escaped string.

This commit adds checks for leading and trailing byte utf-8 compliance.
If invalid, the ill-formed character's bytes are individually mapped to
private use area [U+E000 to U+E0FF] preserving ill-formed character data
in a compact and safe utf-8 friendly format.

Signed-off-by: Matthew Fala <[email protected]>
@matthewfala
Copy link
Contributor Author

Please see PR #4297 for inclusion of this commit into 1.8.

Copy link
Collaborator

@leonardo-albertovich leonardo-albertovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the master version of a patch that was already verified, it looks good to me so as long as all tests pass it's got my approval.

@edsiper edsiper merged commit 861af37 into fluent:master Nov 29, 2021
@edsiper
Copy link
Member

edsiper commented Nov 29, 2021

thanks!

note: please prefix commits only with utils: .... (instead of the function being modified, just the interface name without the flb_ prefix)

0Delta pushed a commit to 0Delta/fluent-bit that referenced this pull request Jan 20, 2022
Previously with unicode byte sequences such as

   0xef 0xbf 0x00 ...

Fluent Bit would blindly trust the first unicode byte 0xef to describe
how many valid trailing unicode bytes to copy.
If a trailing unicode byte is invalid, such as 0x00, the null character,
the utility blindly copied this to the escaped string.

This commit adds checks for leading and trailing byte utf-8 compliance.
If invalid, the ill-formed character's bytes are individually mapped to
private use area [U+E000 to U+E0FF] preserving ill-formed character data
in a compact and safe utf-8 friendly format.

Signed-off-by: Matthew Fala <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants