-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> 1.8 #4297
flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> 1.8 #4297
Conversation
I can make another pull request into master when changes are finalized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7b27898
to
d588bee
Compare
From what I could see there is nothing dangerous about this change, however, I'm not familiar enough with that piece of code so I wrote two small tests (mostly to see if it crashed) :
The first test returned a positive result (no crashes), however, with the second test I noticed something strange (that could be my mistake) which is that it didn't seem to detect and skip a wrong sequence. I got the test sequence It seems to me that the problem is not related to what you were trying to address with your patch but I thought I would mention it just in case it helps. These are the test cases I created : brute force test case Additionally, even though I understand the processing method implemented in the code you modified I'm not familiar with the previous block and I'm not fond of the function as a whole but that's just my opinion. To summarize : I didn't find anything wrong with your patch in terms of what you intended to do nor any security issues but I think I found a preexisting issue there but I want to be clear about my lack of familiarity and acknowledge that I could be wrong about it. Don't hesitate to contact me if there's anything I can do to help! |
Thank you @leonardo-albertovich. I worked on handling this in a fix today and added some test cases. Just one important question. Someone on the team brought up that logs might be in UTF16 or other formats. To confirm, Fluent Bit only expect to deal properly with UTF8/ascii as suggested by functions like I don't think that this code needs to handle the UTF16 or other text formats, right? |
I could be wrong but I'm fairly certain that fluent-bit only handles ASCII / UTF-8. |
Great. That would explain why functions like flb_utf8_len exist. The current version of this patch now includes the invalid sequence you brought up, and now uses a replacement character to mark invalid bytes (non-utf8) instead of getting rid of the bytes. However, I realized that there is another case which is not taken care of, where the leading byte does not begin with bits 11xx xxxx. I'll add tests and make another patch for this case hopefully today. |
0634391
to
184a49b
Compare
@leonardo-albertovich I also added some test cases which should be helpful in comparing input to output. Please let me know if you have any other concerns, comments or questions. |
src/flb_utils.c
Outdated
@@ -744,23 +746,38 @@ int flb_utils_write_str(char *buf, int *off, size_t size, | |||
|
|||
is_valid = FLB_TRUE; | |||
for (b = 0; b < hex_bytes; b++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick : would it be possible to use more expressive names such as utf_sequence_number or utf_sequence_length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I like utf_sequence_length, utf_sequence_number. Maybe i
could also be changed to buf_byte_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/flb_utils.c
Outdated
|
||
if (is_valid) { | ||
encoded_to_buf(p, tmp, hex_bytes); | ||
p += hex_bytes; | ||
} | ||
i += (hex_bytes - 1); | ||
else { | ||
encoded_to_buf(p, flb_utils_replacement_character, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annoying improvement suggestion : wouldn't it be nice if instead of replacing the broken sequence we escaped those borked code units?
"\xe3\x81\x01abc"
would become "\\xe3\\x81\\x01abc"
instead of "�\\u0001abc"
which in a way would be more representative.
We could add the � to mark the broken sequence too if we document it, it just seems less lossy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this but \xe3 is an invalid JSON sequence, and this function is already used in several places (msgpack and cloudwatch api) to sanitize text for JSON. [ \\xe3 could overlap with unescaped text and should also be avoided ]
I also considered:
- escaping \xe3\x81\x01abc to something like \u00e3\u0081\u0001
however I'm pretty sure that the codepoints such as \u0081 will overlap with ascii codepoints see: https://en.wikipedia.org/wiki/UTF-8#Encoding and thus should be avoided - just writing the hex e38101, but that would overlap with generic text
- leaving the invalid fragments, but that would potentially mean the string is not safe/escaped
- deleting the invalid fragments, which is fine, but adding the replacement character seems like an option that gives more information.
After meeting with some people on the team, we determined that the replacement character would most likely be the best way to deal with the invalid fragments. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that \x## could break JSON blocks we could use something like that ? sign as a delimiter and put the list of hex digits between them.
Something like this : � e3 81 01 �abc
What do you think of that? The idea would be not losing data that might help the person affected by the issue figure it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
� e3 81 01 �abc is actually valid Unescaped Unicode so a user could actually have a log with � e3 81 01 �abc. It would be impossible to distinguish this from the invalid fragment once the text is escaped. I think ambiguous data is worse than lost data.
If you want this solution, I can work on it. Just please confirm that you are considering that the output data is ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in the same sense a single � would be valid. Maybe then we could add a disclaimer such as � corrupted utf-8 sequence : e3 81 01 �abc
Now that's quite the footprint for an error but if that's possible then I'd rather go with that (or a shorter variation) than discarding data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I'll work on a fix and post it to this pr hopefully by today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
tests/internal/utils.c
Outdated
/* Invalid unicode (one bad trailing byte) */ | ||
{ | ||
"\xe3\x81\x01""abc", 6, /* note that 0x01 is an invalid byte */ | ||
"�""\\u0001abc" /* replace invalid unicode */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I'm not sure if there's anything in the coding style guide about unicode in the source code but I personally would prefer is there was no unicode in the source code, not because of the latest trend but because I think it pollutes the files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this can be changed into escaped version of the replacement character.
I modified the test cases and code to match the format you prescribed. I'll make sure to squash my commits before the PR is merged. I just want to keep record of what was changed. I changed the variable names here: a006b7b After changing to the new variable names, some of the lines got a lot longer and might be harder to read now. Please let me know if you want me to change them back, because that wouldn't be a problem. |
It seems like there may be a factor here that we have not yet considered: When this function is called:
And elsewhere, it appears that caller is responsible for making a buffer that is large enough that it cannot be overrun. That is because this function has no input identifying output buffer size. The cloudwatch api chooses a buffer size of 6x the unescaped string because previously each byte could potentially be hex and get escaped from 1 byte to 6 via xx -> \u00xx (1 -> 6 characters) If we are going to accept code that transforms |
src/flb_utils.c
Outdated
#define FLB_UTILS_REPLACE_FRAGMENT_END_LEN 4 | ||
|
||
static char *flb_utils_replace_fragment_start | ||
= "\xEF\xBF\xBD"" corrupted utf-8 sequence : "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this a single constant character string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean:
- adding the const keyword to this cstring declaration/definition?
That sounds good. - changing this to a Macro definition
If changed to a macro definition, we'll need to instantiate an actual string later before copying bytes over 1 by one: https://github.com/fluent/fluent-bit/pull/4297/files#diff-d7e2a365ad467f9c9eb8283ff56105fa3a08e7fb55690a61b7137c7704a95263R803-R804
This isn't a problem since I can use a static char inside the function (but it's just adding more code for style?).
src/flb_utils.c
Outdated
} | ||
else { | ||
encoded_to_buf(p, flb_utils_replacement_character, 3); | ||
p += 3; | ||
encoded_to_buf(p, flb_utils_replace_fragment_start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you calculate the length of the whole string you're injecting and verify if there's enough space available before doing anything in this path? Unless I'm wrong there are 34 characters not including the hex encoded bytes and spaces between them so it's substantially higher than what's ensured before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this function does not take out buffer size as a parameter, and so there is no way to verify the expanded size will fit in the out buffer. Please see the comment here: #4297 (comment)
How about converting the invalid fragment: 0xff 0x81 to I hate making up our own convention, so I spent some time today looking for existing conventions that exist for dealing with invalid unicode char points. It seems like by convention, the replacement character was created for this very purpose of replacing invalid sequences. It may be that to prevent loss, we need to come up with our own convention, but it could get confusing to use the replacement character since it exists for the the purpose of my original PR (see the following link) http://www.unicode.org/L2/L2017/17344r2-clarify-utf8.pdf While there isn’t a set convention for what we want (printing out escaped hex) There actually is a range of unicode characters which are reserved for private interpretation. I suggest we use these to represent our escaped hex, more specifically U+E000 to U+E0FF which renders as to https://en.wikipedia.org/wiki/Private_Use_Areas This is compact, preserves data, and would probably not be found in normal log since these characters are found in the private use area (though like the replacement character there is no guarantee). It also matches the hex escape going from 1 invalid byte to 6 expanded bytes (If we decide to represent as an escaped code point \uE0XX or 3 bytes if we decide to represent as an actual unicode character) |
Hi Matt, that's much better than what we previously discussed, it'd take the overhead of the long sequence down immensely so please go ahead with it, thanks for going the extra mile, it really makes a difference! |
f097f40
to
6e0fe88
Compare
@leonardo-albertovich I think the test cases are complete now. I added out buffer overrun tests + code to handle. If you can think of any other test cases, please let me know. |
The code looks good to me, could you fix the DCO issue? then we'll be pretty much ready to close this up. |
Previously with unicode byte sequences such as 0xef 0xbf 0x00 ... Fluent Bit would blindly trust the first unicode byte 0xef to describe how many valid trailing unicode bytes to copy. If a trailing unicode byte is invalid, such as 0x00, the null character, the utility blindly copied this to the escaped string. This commit adds checks for leading and trailing byte utf-8 compliance. If invalid, the ill-formed character's bytes are individually mapped to private use area [U+E000 to U+E0FF] preserving ill-formed character data in a compact and safe utf-8 friendly format. Signed-off-by: Matthew Fala <[email protected]>
da01c17
to
8eb5c8c
Compare
I squashed the commits and signed off. |
This is |
Please see PR #4346 for inclusion of this commit into master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, as long as it passes the tests it's got my approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn’t leave a comment on the line because it was not changed in this commit, but the cast on line 659/670 looks a bit iffy
c = (uint32_t) str[i];
I think it should be c = (uint32_t) (unsigned char) str[i];
I’ll test and confirm later.
Did you test with invalid 2 byte sequences? I think you may see more issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c = (uint32_t) (unsigned char) str[i];
Please refer to response to question 2.
Unfortunately, I do not have the time or bandwidth to run tests. Since you have already decided on these changes, go ahead with your merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have anything useful to add here except that its probably worthwhile to add something in the docs about how Fluent Bit handles invalid utf-8 for users.
And possibly (if time allows), may be some deeper details in the dev guide if you think that might help future devs (both folks modifying this code, and folks writing plugins that handle utf 8): https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md
thanks! note: please prefix commits only with utils: .... (instead of the function being modified, just the interface name without the flb_ prefix) |
Previously with unicode byte sequences such as
0xef 0xbf 0x00 ...
Fluent Bit would blindly trust the first unicode byte 0xef to describe
how many valid trailing unicode bytes to copy.
If a trailing unicode byte is invalid, such as 0x00, the null character,
the utility blindly copied this to the escaped string.
This commit adds checks for leading and trailing byte utf-8 compliance.
If invalid, the ill-formed character's bytes are individually mapped to
private use area [U+E000 to U+E0FF] preserving ill-formed character data
in a compact and safe utf-8 friendly format.
Signed-off-by: Matthew Fala [email protected]
Version 1 (See comments for documentation on revisions)
I faked some invalid unicode (not sure how it would get to this part of the cloudwatch plugin, but apparently it is)
Fluent bit fails just as customer fluent bit fails.
Add some new code to invalidate unicode when trailing characters do not adhere to the UTF-8 specs
New code:
Now fluent bit skips the invalid unicode bytes and the serialization exception no longer ouccurs
The problem with invalid unicode is this: Fluent bit would previously blindly copy over unicode bytes as specified by the unicode character, not paying attention to if those bytes are up to unicode specs. This usually isn't a problem, but if a null character somehow sneeks in (as shown above) the c string containing the cloudwatch payload gets terminated. This causes a serialization exception.
This also potentially affected other plugins such as the output file plugin.
Config file
Enter
[N/A]
in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
Documentation
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.