flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> master #4346

matthewfala · 2021-11-18T20:29:18Z

Previously with unicode byte sequences such as

0xef 0xbf 0x00 ...

Fluent Bit would blindly trust the first unicode byte 0xef to describe
how many valid trailing unicode bytes to copy.
If a trailing unicode byte is invalid, such as 0x00, the null character,
the utility blindly copied this to the escaped string.

This commit adds checks for leading and trailing byte utf-8 compliance.
If invalid, the ill-formed character's bytes are individually mapped to
private use area [U+E000 to U+E0FF] preserving ill-formed character data
in a compact and safe utf-8 friendly format.

Signed-off-by: Matthew Fala [email protected]

Final version end to end tested with the following:

This code is added just before flb_utils_write_str is called in the cloudwatch plugin. It adds a null character at the start of the first character (actually the second character, because the first character in our test is a double quote)

        // Corrupt log utf-8
        unsigned char* c1 = tmp_buf_ptr + 0;
        unsigned char* c2 = tmp_buf_ptr + 1;
        unsigned char* c3 = tmp_buf_ptr + 2;
        *c2 = 0;

Send a log starting with the smiley face emoji via HTTP input plugin

[
    "😀this is a small regular \u0000 log."
]

Check cloudwatch to see if the logs have successfully made the way through (though corrupted)

"�this is a small regular \u0000 log."

The code successfully sanitizes the corrupt unicode and maps the invalid smiley face bytes to a safe private use area utf-8 region

Data flow

😀 = [0xf0 0x9f 0x98 0x80] -> [0x00 0x9f 0x98 0x80] -> � = [ \u0000, 0xee 0x82 0x9f, 0xee 0x82 0x98, 0xee 0x82 0x80] -> cloudwatch api

For discussion on this commit, please see PR: #4297

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Previously with unicode byte sequences such as 0xef 0xbf 0x00 ... Fluent Bit would blindly trust the first unicode byte 0xef to describe how many valid trailing unicode bytes to copy. If a trailing unicode byte is invalid, such as 0x00, the null character, the utility blindly copied this to the escaped string. This commit adds checks for leading and trailing byte utf-8 compliance. If invalid, the ill-formed character's bytes are individually mapped to private use area [U+E000 to U+E0FF] preserving ill-formed character data in a compact and safe utf-8 friendly format. Signed-off-by: Matthew Fala <[email protected]>

matthewfala · 2021-11-18T20:34:26Z

Please see PR #4297 for inclusion of this commit into 1.8.

leonardo-albertovich

This is the master version of a patch that was already verified, it looks good to me so as long as all tests pass it's got my approval.

edsiper · 2021-11-29T22:40:01Z

thanks!

note: please prefix commits only with utils: .... (instead of the function being modified, just the interface name without the flb_ prefix)

Previously with unicode byte sequences such as 0xef 0xbf 0x00 ... Fluent Bit would blindly trust the first unicode byte 0xef to describe how many valid trailing unicode bytes to copy. If a trailing unicode byte is invalid, such as 0x00, the null character, the utility blindly copied this to the escaped string. This commit adds checks for leading and trailing byte utf-8 compliance. If invalid, the ill-formed character's bytes are individually mapped to private use area [U+E000 to U+E0FF] preserving ill-formed character data in a compact and safe utf-8 friendly format. Signed-off-by: Matthew Fala <[email protected]>

matthewfala mentioned this pull request Nov 18, 2021

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> 1.8 #4297

Merged

2 tasks

github-actions bot added the docs-required label Nov 18, 2021

leonardo-albertovich approved these changes Nov 19, 2021

View reviewed changes

edsiper merged commit 861af37 into fluent:master Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> master #4346

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> master #4346

matthewfala commented Nov 18, 2021 •

edited

Loading

matthewfala commented Nov 18, 2021

leonardo-albertovich left a comment

edsiper commented Nov 29, 2021

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> master #4346

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> master #4346

Conversation

matthewfala commented Nov 18, 2021 • edited Loading

matthewfala commented Nov 18, 2021

leonardo-albertovich left a comment

Choose a reason for hiding this comment

edsiper commented Nov 29, 2021

matthewfala commented Nov 18, 2021 •

edited

Loading