Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> 1.8 #4297

Merged
merged 1 commit into from
Nov 29, 2021

Conversation

matthewfala
Copy link
Contributor

@matthewfala matthewfala commented Nov 9, 2021

Previously with unicode byte sequences such as

0xef 0xbf 0x00 ...

Fluent Bit would blindly trust the first unicode byte 0xef to describe
how many valid trailing unicode bytes to copy.
If a trailing unicode byte is invalid, such as 0x00, the null character,
the utility blindly copied this to the escaped string.

This commit adds checks for leading and trailing byte utf-8 compliance.
If invalid, the ill-formed character's bytes are individually mapped to
private use area [U+E000 to U+E0FF] preserving ill-formed character data
in a compact and safe utf-8 friendly format.

Signed-off-by: Matthew Fala [email protected]

Version 1 (See comments for documentation on revisions)

I faked some invalid unicode (not sure how it would get to this part of the cloudwatch plugin, but apparently it is)

            char* c1 = str + i+0;
            char* c2 = str + i+1;
            char* c3 = str + i+2;
            *c3 = 0;
            is_valid = FLB_TRUE;
            for (b = 0; b < hex_bytes; b++) {
                tmp[b] = str[i+b];
            }

Fluent bit fails just as customer fluent bit fails.

[2021/11/09 01:56:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Log Stream batch3-2_app.log already exists
[2021/11/09 01:56:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] Sending 1 events
[2021/11/09 01:56:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] Sending log events to log stream batch3-2_app.log
[2021/11/09 01:56:57] [debug] [http_client] not using http_proxy for header
[2021/11/09 01:56:57] [debug] [aws_credentials] Requesting credentials from the EC2 provider..
[2021/11/09 01:56:57] [debug] [http_client] server logs.us-west-2.amazonaws.com:443 will close connection #33
[2021/11/09 01:56:57] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=0, HTTP Status: 400
[2021/11/09 01:56:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents http status=400
[2021/11/09 01:56:58] [error] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents API responded with error='SerializationException'
[2021/11/09 01:57:00] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/11/09 01:57:00] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/11/09 01:57:00] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send events

Add some new code to invalidate unicode when trailing characters do not adhere to the UTF-8 specs

                /* Trailing characters must start with bit 1 */
                if (b > 0 && !(str[i+b] & 0x80)) {
                    /* Invalid unicode character. skip */
                    flb_warn("[pack] invalid UTF-8 bytes found, skipping bytes");
                    is_valid = FLB_FALSE;
                    break;
                }

New code:

            char* c1 = str + i+0;
            char* c2 = str + i+1;
            char* c3 = str + i+2;
            *c3 = 0;
            is_valid = FLB_TRUE;
            for (b = 0; b < hex_bytes; b++) {
                /* Trailing characters must start with bit 1 */
                if (b > 0 && !(str[i+b] & 0x80)) {
                    /* Invalid unicode character. skip */
                    flb_warn("[pack] invalid UTF-8 bytes found, skipping bytes");
                    is_valid = FLB_FALSE;
                    break;
                }

                tmp[b] = str[i+b];
            }

Now fluent bit skips the invalid unicode bytes and the serialization exception no longer ouccurs

[2021/11/09 02:01:20] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents http status=200
[2021/11/09 02:01:20] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] Sent events to batch3-2_app.log
[2021/11/09 02:01:20] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] Sent 1 events to CloudWatch
[2021/11/09 02:01:22] [ warn] [pack] invalid UTF-8 bytes found, skipping bytes
[2021/11/09 02:01:32] [debug] [out coro] cb_destroy coro_id=0

The problem with invalid unicode is this: Fluent bit would previously blindly copy over unicode bytes as specified by the unicode character, not paying attention to if those bytes are up to unicode specs. This usually isn't a problem, but if a null character somehow sneeks in (as shown above) the c string containing the cloudwatch payload gets terminated. This causes a serialization exception.

This also potentially affected other plugins such as the output file plugin.

Config file

[SERVICE]
     Grace 30
     Log_Level debug

# Provide entry point for logs
[INPUT]
     Name http
     host 0.0.0.0
     port 8888

[INPUT]
     Name        forward
     Listen      0.0.0.0
     Port        24224

[OUTPUT]
     Name stdout
     Match *
     
[OUTPUT]
     Name cloudwatch_logs
     Match *
     log_stream_prefix batch3-2_
     log_group_name xx
     auto_create_group true
     region us-west-2

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • [N/A] Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • [N/A] Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@matthewfala
Copy link
Contributor Author

I can make another pull request into master when changes are finalized.

src/flb_utils.c Outdated Show resolved Hide resolved
Copy link
Contributor

@PettitWesley PettitWesley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need @edsiper Or @nokute78 Or some other core maintainer to approve this before we should merge it.

@matthewfala matthewfala force-pushed the cloudwatch-serialization-exception branch from 7b27898 to d588bee Compare November 10, 2021 02:03
@leonardo-albertovich
Copy link
Collaborator

From what I could see there is nothing dangerous about this change, however, I'm not familiar enough with that piece of code so I wrote two small tests (mostly to see if it crashed) :

  1. Reads a random amount of bytes from /dev/urandom and passes it through flb_utils_write_str
  2. Passes an arbitrary UTF-8 sequence through flb_utils_write_str and writes the result to stdout

The first test returned a positive result (no crashes), however, with the second test I noticed something strange (that could be my mistake) which is that it didn't seem to detect and skip a wrong sequence.

I got the test sequence \xE6\xBC\xA2\xE5\xAD\x97 from this web tool, passed it through the system and it worked. Then I modified it to extend the first sequence by toggling one additional bit in the first byte \xF6\xBC\xA2\xE5\xAD\x97 which resulted in the web tool detecting it as a bad sequence but the test case not doing so.

It seems to me that the problem is not related to what you were trying to address with your patch but I thought I would mention it just in case it helps.

These are the test cases I created :

brute force test case
specific test case

Additionally, even though I understand the processing method implemented in the code you modified I'm not familiar with the previous block and I'm not fond of the function as a whole but that's just my opinion.

To summarize : I didn't find anything wrong with your patch in terms of what you intended to do nor any security issues but I think I found a preexisting issue there but I want to be clear about my lack of familiarity and acknowledge that I could be wrong about it.

Don't hesitate to contact me if there's anything I can do to help!

@matthewfala
Copy link
Contributor Author

Thank you @leonardo-albertovich. I worked on handling this in a fix today and added some test cases.
I'll push them tomorrow.

Just one important question. Someone on the team brought up that logs might be in UTF16 or other formats. To confirm, Fluent Bit only expect to deal properly with UTF8/ascii as suggested by functions like flb_utf8_len?

I don't think that this code needs to handle the UTF16 or other text formats, right?

@leonardo-albertovich
Copy link
Collaborator

Thank you @leonardo-albertovich. I worked on handling this in a fix today and added some test cases. I'll push them tomorrow.

Just one important question. Someone on the team brought up that logs might be in UTF16 or other formats. To confirm, Fluent Bit only expect to deal properly with UTF8/ascii as suggested by functions like flb_utf8_len?

I don't think that this code needs to handle the UTF16 or other text formats, right?

I could be wrong but I'm fairly certain that fluent-bit only handles ASCII / UTF-8.

@matthewfala
Copy link
Contributor Author

Great. That would explain why functions like flb_utf8_len exist. The current version of this patch now includes the invalid sequence you brought up, and now uses a replacement character to mark invalid bytes (non-utf8) instead of getting rid of the bytes.

However, I realized that there is another case which is not taken care of, where the leading byte does not begin with bits 11xx xxxx. I'll add tests and make another patch for this case hopefully today.

@matthewfala matthewfala force-pushed the cloudwatch-serialization-exception branch 4 times, most recently from 0634391 to 184a49b Compare November 12, 2021 02:08
@matthewfala
Copy link
Contributor Author

@leonardo-albertovich
Could you please take another look? I am ready for your re-review.
Since you last checked, I added handling of the invalid sequence type you described, as well as another.

I also added some test cases which should be helpful in comparing input to output.

Please let me know if you have any other concerns, comments or questions.

src/flb_utils.c Outdated
@@ -744,23 +746,38 @@ int flb_utils_write_str(char *buf, int *off, size_t size,

is_valid = FLB_TRUE;
for (b = 0; b < hex_bytes; b++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick : would it be possible to use more expressive names such as utf_sequence_number or utf_sequence_length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I like utf_sequence_length, utf_sequence_number. Maybe i could also be changed to buf_byte_index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

src/flb_utils.c Outdated

if (is_valid) {
encoded_to_buf(p, tmp, hex_bytes);
p += hex_bytes;
}
i += (hex_bytes - 1);
else {
encoded_to_buf(p, flb_utils_replacement_character, 3);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annoying improvement suggestion : wouldn't it be nice if instead of replacing the broken sequence we escaped those borked code units?

"\xe3\x81\x01abc" would become "\\xe3\\x81\\x01abc" instead of "�\\u0001abc" which in a way would be more representative.

We could add the � to mark the broken sequence too if we document it, it just seems less lossy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this but \xe3 is an invalid JSON sequence, and this function is already used in several places (msgpack and cloudwatch api) to sanitize text for JSON. [ \\xe3 could overlap with unescaped text and should also be avoided ]

I also considered:

  • escaping \xe3\x81\x01abc to something like \u00e3\u0081\u0001
    however I'm pretty sure that the codepoints such as \u0081 will overlap with ascii codepoints see: https://en.wikipedia.org/wiki/UTF-8#Encoding and thus should be avoided
  • just writing the hex e38101, but that would overlap with generic text
  • leaving the invalid fragments, but that would potentially mean the string is not safe/escaped
  • deleting the invalid fragments, which is fine, but adding the replacement character seems like an option that gives more information.

After meeting with some people on the team, we determined that the replacement character would most likely be the best way to deal with the invalid fragments. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that \x## could break JSON blocks we could use something like that ? sign as a delimiter and put the list of hex digits between them.

Something like this : � e3 81 01 �abc

What do you think of that? The idea would be not losing data that might help the person affected by the issue figure it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

� e3 81 01 �abc is actually valid Unescaped Unicode so a user could actually have a log with � e3 81 01 �abc. It would be impossible to distinguish this from the invalid fragment once the text is escaped. I think ambiguous data is worse than lost data.

If you want this solution, I can work on it. Just please confirm that you are considering that the output data is ambiguous.

Copy link
Collaborator

@leonardo-albertovich leonardo-albertovich Nov 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in the same sense a single � would be valid. Maybe then we could add a disclaimer such as � corrupted utf-8 sequence : e3 81 01 �abc

Now that's quite the footprint for an error but if that's possible then I'd rather go with that (or a shorter variation) than discarding data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I'll work on a fix and post it to this pr hopefully by today.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/* Invalid unicode (one bad trailing byte) */
{
"\xe3\x81\x01""abc", 6, /* note that 0x01 is an invalid byte */
"�""\\u0001abc" /* replace invalid unicode */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I'm not sure if there's anything in the coding style guide about unicode in the source code but I personally would prefer is there was no unicode in the source code, not because of the latest trend but because I think it pollutes the files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this can be changed into escaped version of the replacement character.

@matthewfala
Copy link
Contributor Author

matthewfala commented Nov 16, 2021

I modified the test cases and code to match the format you prescribed.

I'll make sure to squash my commits before the PR is merged. I just want to keep record of what was changed.

I changed the variable names here: a006b7b

After changing to the new variable names, some of the lines got a lot longer and might be harder to read now. Please let me know if you want me to change them back, because that wouldn't be a problem.

@matthewfala
Copy link
Contributor Author

It seems like there may be a factor here that we have not yet considered:

When this function is called:

* If all chars need to be hex encoded (impossible), 6x space would be

And elsewhere, it appears that caller is responsible for making a buffer that is large enough that it cannot be overrun. That is because this function has no input identifying output buffer size.

The cloudwatch api chooses a buffer size of 6x the unescaped string because previously each byte could potentially be hex and get escaped from 1 byte to 6 via xx -> \u00xx (1 -> 6 characters)

If we are going to accept code that transforms xx to � xx � or � corrupted utf-8 sequence : xx �
we are changing this inflation from 6 to 10 or ~40 respectively. This may not be acceptable. What do you think @leonardo-albertovich

src/flb_utils.c Outdated
#define FLB_UTILS_REPLACE_FRAGMENT_END_LEN 4

static char *flb_utils_replace_fragment_start
= "\xEF\xBF\xBD"" corrupted utf-8 sequence : ";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this a single constant character string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean:

  1. adding the const keyword to this cstring declaration/definition?
    That sounds good.
  2. changing this to a Macro definition
    If changed to a macro definition, we'll need to instantiate an actual string later before copying bytes over 1 by one: https://github.com/fluent/fluent-bit/pull/4297/files#diff-d7e2a365ad467f9c9eb8283ff56105fa3a08e7fb55690a61b7137c7704a95263R803-R804
    This isn't a problem since I can use a static char inside the function (but it's just adding more code for style?).

src/flb_utils.c Outdated
}
else {
encoded_to_buf(p, flb_utils_replacement_character, 3);
p += 3;
encoded_to_buf(p, flb_utils_replace_fragment_start,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you calculate the length of the whole string you're injecting and verify if there's enough space available before doing anything in this path? Unless I'm wrong there are 34 characters not including the hex encoded bytes and spaces between them so it's substantially higher than what's ensured before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this function does not take out buffer size as a parameter, and so there is no way to verify the expanded size will fit in the out buffer. Please see the comment here: #4297 (comment)

@matthewfala
Copy link
Contributor Author

How about converting the invalid fragment: 0xff 0x81 to 

I hate making up our own convention, so I spent some time today looking for existing conventions that exist for dealing with invalid unicode char points.

It seems like by convention, the replacement character was created for this very purpose of replacing invalid sequences. It may be that to prevent loss, we need to come up with our own convention, but it could get confusing to use the replacement character since it exists for the the purpose of my original PR (see the following link)

http://www.unicode.org/L2/L2017/17344r2-clarify-utf8.pdf
http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf (see page 129 on converting Ill-Formed Sequences)

While there isn’t a set convention for what we want (printing out escaped hex) There actually is a range of unicode characters which are reserved for private interpretation. I suggest we use these to represent our escaped hex, more specifically U+E000 to U+E0FF which renders as  to 

https://en.wikipedia.org/wiki/Private_Use_Areas

This is compact, preserves data, and would probably not be found in normal log since these characters are found in the private use area (though like the replacement character there is no guarantee).

It also matches the hex escape going from 1 invalid byte to 6 expanded bytes (If we decide to represent as an escaped code point \uE0XX or 3 bytes if we decide to represent as an actual unicode character)

@leonardo-albertovich
Copy link
Collaborator

Hi Matt, that's much better than what we previously discussed, it'd take the overhead of the long sequence down immensely so please go ahead with it, thanks for going the extra mile, it really makes a difference!

@matthewfala matthewfala force-pushed the cloudwatch-serialization-exception branch 2 times, most recently from f097f40 to 6e0fe88 Compare November 17, 2021 22:49
@matthewfala
Copy link
Contributor Author

@leonardo-albertovich
Hi Leonardo. I implemented using the hex mapping to unicode private use area characters U+E000 to U+E0FF. Would you please take a look when you get the chance?

I think the test cases are complete now. I added out buffer overrun tests + code to handle. If you can think of any other test cases, please let me know.

@leonardo-albertovich
Copy link
Collaborator

The code looks good to me, could you fix the DCO issue? then we'll be pretty much ready to close this up.

Previously with unicode byte sequences such as

   0xef 0xbf 0x00 ...

Fluent Bit would blindly trust the first unicode byte 0xef to describe
how many valid trailing unicode bytes to copy.
If a trailing unicode byte is invalid, such as 0x00, the null character,
the utility blindly copied this to the escaped string.

This commit adds checks for leading and trailing byte utf-8 compliance.
If invalid, the ill-formed character's bytes are individually mapped to
private use area [U+E000 to U+E0FF] preserving ill-formed character data
in a compact and safe utf-8 friendly format.

Signed-off-by: Matthew Fala <[email protected]>
@matthewfala matthewfala force-pushed the cloudwatch-serialization-exception branch from da01c17 to 8eb5c8c Compare November 18, 2021 19:37
@matthewfala
Copy link
Contributor Author

I squashed the commits and signed off.

@matthewfala
Copy link
Contributor Author

matthewfala commented Nov 18, 2021

This is merged merging into 1.8. I'll make another PR to master since you think this PR is complete.

@matthewfala matthewfala changed the title flb_utils_write_str: omit unicode with malform trailing bytes flb_utils_write_str: detect and replace ill-formed utf-8 bytes Nov 18, 2021
@matthewfala matthewfala changed the title flb_utils_write_str: detect and replace ill-formed utf-8 bytes flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> 1.8 Nov 18, 2021
@matthewfala
Copy link
Contributor Author

matthewfala commented Nov 18, 2021

Please see PR #4346 for inclusion of this commit into master.

Copy link
Collaborator

@leonardo-albertovich leonardo-albertovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, as long as it passes the tests it's got my approval.

Copy link
Contributor

@krispraws krispraws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn’t leave a comment on the line because it was not changed in this commit, but the cast on line 659/670 looks a bit iffy
c = (uint32_t) str[i];
I think it should be c = (uint32_t) (unsigned char) str[i];
I’ll test and confirm later.
Did you test with invalid 2 byte sequences? I think you may see more issues.

Copy link
Contributor Author

@matthewfala matthewfala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c = (uint32_t) (unsigned char) str[i]; Please refer to response to question 2.

src/flb_utils.c Show resolved Hide resolved
src/flb_utils.c Show resolved Hide resolved
@krispraws
Copy link
Contributor

Unfortunately, I do not have the time or bandwidth to run tests. Since you have already decided on these changes, go ahead with your merge.

Copy link
Contributor

@PettitWesley PettitWesley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have anything useful to add here except that its probably worthwhile to add something in the docs about how Fluent Bit handles invalid utf-8 for users.

And possibly (if time allows), may be some deeper details in the dev guide if you think that might help future devs (both folks modifying this code, and folks writing plugins that handle utf 8): https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md

@edsiper edsiper merged commit b5538fc into fluent:1.8 Nov 29, 2021
@edsiper
Copy link
Member

edsiper commented Nov 29, 2021

thanks!

note: please prefix commits only with utils: .... (instead of the function being modified, just the interface name without the flb_ prefix)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants