flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> 1.8 #4297

matthewfala · 2021-11-09T18:20:33Z

Previously with unicode byte sequences such as

0xef 0xbf 0x00 ...

Fluent Bit would blindly trust the first unicode byte 0xef to describe
how many valid trailing unicode bytes to copy.
If a trailing unicode byte is invalid, such as 0x00, the null character,
the utility blindly copied this to the escaped string.

This commit adds checks for leading and trailing byte utf-8 compliance.
If invalid, the ill-formed character's bytes are individually mapped to
private use area [U+E000 to U+E0FF] preserving ill-formed character data
in a compact and safe utf-8 friendly format.

Signed-off-by: Matthew Fala [email protected]

Version 1 (See comments for documentation on revisions)

I faked some invalid unicode (not sure how it would get to this part of the cloudwatch plugin, but apparently it is)

            char* c1 = str + i+0;
            char* c2 = str + i+1;
            char* c3 = str + i+2;
            *c3 = 0;
            is_valid = FLB_TRUE;
            for (b = 0; b < hex_bytes; b++) {
                tmp[b] = str[i+b];
            }

Fluent bit fails just as customer fluent bit fails.

[2021/11/09 01:56:45] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Log Stream batch3-2_app.log already exists
[2021/11/09 01:56:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] Sending 1 events
[2021/11/09 01:56:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] Sending log events to log stream batch3-2_app.log
[2021/11/09 01:56:57] [debug] [http_client] not using http_proxy for header
[2021/11/09 01:56:57] [debug] [aws_credentials] Requesting credentials from the EC2 provider..
[2021/11/09 01:56:57] [debug] [http_client] server logs.us-west-2.amazonaws.com:443 will close connection #33
[2021/11/09 01:56:57] [debug] [aws_client] logs.us-west-2.amazonaws.com: http_do=0, HTTP Status: 400
[2021/11/09 01:56:57] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents http status=400
[2021/11/09 01:56:58] [error] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents API responded with error='SerializationException'
[2021/11/09 01:57:00] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/11/09 01:57:00] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send log events
[2021/11/09 01:57:00] [error] [output:cloudwatch_logs:cloudwatch_logs.1] Failed to send events

Add some new code to invalidate unicode when trailing characters do not adhere to the UTF-8 specs

                /* Trailing characters must start with bit 1 */
                if (b > 0 && !(str[i+b] & 0x80)) {
                    /* Invalid unicode character. skip */
                    flb_warn("[pack] invalid UTF-8 bytes found, skipping bytes");
                    is_valid = FLB_FALSE;
                    break;
                }

New code:

            char* c1 = str + i+0;
            char* c2 = str + i+1;
            char* c3 = str + i+2;
            *c3 = 0;
            is_valid = FLB_TRUE;
            for (b = 0; b < hex_bytes; b++) {
                /* Trailing characters must start with bit 1 */
                if (b > 0 && !(str[i+b] & 0x80)) {
                    /* Invalid unicode character. skip */
                    flb_warn("[pack] invalid UTF-8 bytes found, skipping bytes");
                    is_valid = FLB_FALSE;
                    break;
                }

                tmp[b] = str[i+b];
            }

Now fluent bit skips the invalid unicode bytes and the serialization exception no longer ouccurs

[2021/11/09 02:01:20] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] PutLogEvents http status=200
[2021/11/09 02:01:20] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] Sent events to batch3-2_app.log
[2021/11/09 02:01:20] [debug] [output:cloudwatch_logs:cloudwatch_logs.1] Sent 1 events to CloudWatch
[2021/11/09 02:01:22] [ warn] [pack] invalid UTF-8 bytes found, skipping bytes
[2021/11/09 02:01:32] [debug] [out coro] cb_destroy coro_id=0

The problem with invalid unicode is this: Fluent bit would previously blindly copy over unicode bytes as specified by the unicode character, not paying attention to if those bytes are up to unicode specs. This usually isn't a problem, but if a null character somehow sneeks in (as shown above) the c string containing the cloudwatch payload gets terminated. This causes a serialization exception.

This also potentially affected other plugins such as the output file plugin.

Config file

[SERVICE]
     Grace 30
     Log_Level debug

# Provide entry point for logs
[INPUT]
     Name http
     host 0.0.0.0
     port 8888

[INPUT]
     Name        forward
     Listen      0.0.0.0
     Port        24224

[OUTPUT]
     Name stdout
     Match *
     
[OUTPUT]
     Name cloudwatch_logs
     Match *
     log_stream_prefix batch3-2_
     log_group_name xx
     auto_create_group true
     region us-west-2

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

[N/A] Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

[N/A] Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

matthewfala · 2021-11-09T18:22:07Z

I can make another pull request into master when changes are finalized.

src/flb_utils.c

PettitWesley

I think we need @edsiper Or @nokute78 Or some other core maintainer to approve this before we should merge it.

leonardo-albertovich · 2021-11-10T12:59:36Z

From what I could see there is nothing dangerous about this change, however, I'm not familiar enough with that piece of code so I wrote two small tests (mostly to see if it crashed) :

Reads a random amount of bytes from /dev/urandom and passes it through flb_utils_write_str
Passes an arbitrary UTF-8 sequence through flb_utils_write_str and writes the result to stdout

The first test returned a positive result (no crashes), however, with the second test I noticed something strange (that could be my mistake) which is that it didn't seem to detect and skip a wrong sequence.

I got the test sequence \xE6\xBC\xA2\xE5\xAD\x97 from this web tool, passed it through the system and it worked. Then I modified it to extend the first sequence by toggling one additional bit in the first byte \xF6\xBC\xA2\xE5\xAD\x97 which resulted in the web tool detecting it as a bad sequence but the test case not doing so.

It seems to me that the problem is not related to what you were trying to address with your patch but I thought I would mention it just in case it helps.

These are the test cases I created :

brute force test case
specific test case

Additionally, even though I understand the processing method implemented in the code you modified I'm not familiar with the previous block and I'm not fond of the function as a whole but that's just my opinion.

To summarize : I didn't find anything wrong with your patch in terms of what you intended to do nor any security issues but I think I found a preexisting issue there but I want to be clear about my lack of familiarity and acknowledge that I could be wrong about it.

Don't hesitate to contact me if there's anything I can do to help!

matthewfala · 2021-11-11T02:27:08Z

Thank you @leonardo-albertovich. I worked on handling this in a fix today and added some test cases.
I'll push them tomorrow.

Just one important question. Someone on the team brought up that logs might be in UTF16 or other formats. To confirm, Fluent Bit only expect to deal properly with UTF8/ascii as suggested by functions like flb_utf8_len?

I don't think that this code needs to handle the UTF16 or other text formats, right?

leonardo-albertovich · 2021-11-11T10:16:59Z

Thank you @leonardo-albertovich. I worked on handling this in a fix today and added some test cases. I'll push them tomorrow.

Just one important question. Someone on the team brought up that logs might be in UTF16 or other formats. To confirm, Fluent Bit only expect to deal properly with UTF8/ascii as suggested by functions like flb_utf8_len?

I don't think that this code needs to handle the UTF16 or other text formats, right?

I could be wrong but I'm fairly certain that fluent-bit only handles ASCII / UTF-8.

matthewfala · 2021-11-11T19:31:39Z

Great. That would explain why functions like flb_utf8_len exist. The current version of this patch now includes the invalid sequence you brought up, and now uses a replacement character to mark invalid bytes (non-utf8) instead of getting rid of the bytes.

However, I realized that there is another case which is not taken care of, where the leading byte does not begin with bits 11xx xxxx. I'll add tests and make another patch for this case hopefully today.

matthewfala · 2021-11-12T02:11:32Z

@leonardo-albertovich
Could you please take another look? I am ready for your re-review.
Since you last checked, I added handling of the invalid sequence type you described, as well as another.

I also added some test cases which should be helpful in comparing input to output.

Please let me know if you have any other concerns, comments or questions.

leonardo-albertovich · 2021-11-12T11:53:11Z

src/flb_utils.c

@@ -744,23 +746,38 @@ int flb_utils_write_str(char *buf, int *off, size_t size,

            is_valid = FLB_TRUE;
            for (b = 0; b < hex_bytes; b++) {


nitpick : would it be possible to use more expressive names such as utf_sequence_number or utf_sequence_length?

Sounds good. I like utf_sequence_length, utf_sequence_number. Maybe i could also be changed to buf_byte_index

leonardo-albertovich · 2021-11-12T11:58:03Z

src/flb_utils.c


            if (is_valid) {
                encoded_to_buf(p, tmp, hex_bytes);
                p += hex_bytes;
            }
-            i += (hex_bytes - 1);
+            else {
+                encoded_to_buf(p, flb_utils_replacement_character, 3);


annoying improvement suggestion : wouldn't it be nice if instead of replacing the broken sequence we escaped those borked code units?

"\xe3\x81\x01abc" would become "\\xe3\\x81\\x01abc" instead of "�\\u0001abc" which in a way would be more representative.

We could add the � to mark the broken sequence too if we document it, it just seems less lossy.

I did consider this but \xe3 is an invalid JSON sequence, and this function is already used in several places (msgpack and cloudwatch api) to sanitize text for JSON. [ \\xe3 could overlap with unescaped text and should also be avoided ]

I also considered:

escaping \xe3\x81\x01abc to something like \u00e3\u0081\u0001
however I'm pretty sure that the codepoints such as \u0081 will overlap with ascii codepoints see: https://en.wikipedia.org/wiki/UTF-8#Encoding and thus should be avoided

just writing the hex e38101, but that would overlap with generic text

leaving the invalid fragments, but that would potentially mean the string is not safe/escaped

deleting the invalid fragments, which is fine, but adding the replacement character seems like an option that gives more information.

After meeting with some people on the team, we determined that the replacement character would most likely be the best way to deal with the invalid fragments. What do you think?

@leonardo-albertovich

Considering that \x## could break JSON blocks we could use something like that ? sign as a delimiter and put the list of hex digits between them.

Something like this : � e3 81 01 �abc

What do you think of that? The idea would be not losing data that might help the person affected by the issue figure it out.

� e3 81 01 �abc is actually valid Unescaped Unicode so a user could actually have a log with � e3 81 01 �abc. It would be impossible to distinguish this from the invalid fragment once the text is escaped. I think ambiguous data is worse than lost data.

If you want this solution, I can work on it. Just please confirm that you are considering that the output data is ambiguous.

I guess in the same sense a single � would be valid. Maybe then we could add a disclaimer such as � corrupted utf-8 sequence : e3 81 01 �abc

Now that's quite the footprint for an error but if that's possible then I'd rather go with that (or a shorter variation) than discarding data.

Great. I'll work on a fix and post it to this pr hopefully by today.

leonardo-albertovich · 2021-11-12T11:59:40Z

tests/internal/utils.c

+        /* Invalid unicode (one bad trailing byte) */
+        {
+            "\xe3\x81\x01""abc", 6,  /* note that 0x01 is an invalid byte */
+            "�""\\u0001abc"  /* replace invalid unicode */


nitpick: I'm not sure if there's anything in the coding style guide about unicode in the source code but I personally would prefer is there was no unicode in the source code, not because of the latest trend but because I think it pollutes the files.

Sure, this can be changed into escaped version of the replacement character.

matthewfala · 2021-11-16T02:29:43Z

I modified the test cases and code to match the format you prescribed.

I'll make sure to squash my commits before the PR is merged. I just want to keep record of what was changed.

I changed the variable names here: a006b7b

After changing to the new variable names, some of the lines got a lot longer and might be harder to read now. Please let me know if you want me to change them back, because that wouldn't be a problem.

matthewfala · 2021-11-16T03:27:19Z

It seems like there may be a factor here that we have not yet considered:

When this function is called:

fluent-bit/plugins/out_cloudwatch_logs/cloudwatch_api.c

Line 426 in 50a930a

* If all chars need to be hex encoded (impossible), 6x space would be

And elsewhere, it appears that caller is responsible for making a buffer that is large enough that it cannot be overrun. That is because this function has no input identifying output buffer size.

The cloudwatch api chooses a buffer size of 6x the unescaped string because previously each byte could potentially be hex and get escaped from 1 byte to 6 via xx -> \u00xx (1 -> 6 characters)

If we are going to accept code that transforms xx to � xx � or � corrupted utf-8 sequence : xx �
we are changing this inflation from 6 to 10 or ~40 respectively. This may not be acceptable. What do you think @leonardo-albertovich

leonardo-albertovich · 2021-11-16T10:53:13Z

src/flb_utils.c

+#define FLB_UTILS_REPLACE_FRAGMENT_END_LEN 4
+
+static char *flb_utils_replace_fragment_start
+    = "\xEF\xBF\xBD"" corrupted utf-8 sequence : ";


Could you make this a single constant character string?

Do you mean:

adding the const keyword to this cstring declaration/definition?
That sounds good.

changing this to a Macro definition
If changed to a macro definition, we'll need to instantiate an actual string later before copying bytes over 1 by one: https://github.com/fluent/fluent-bit/pull/4297/files#diff-d7e2a365ad467f9c9eb8283ff56105fa3a08e7fb55690a61b7137c7704a95263R803-R804
This isn't a problem since I can use a static char inside the function (but it's just adding more code for style?).

leonardo-albertovich · 2021-11-16T11:03:24Z

src/flb_utils.c

            }
            else {
-                encoded_to_buf(p, flb_utils_replacement_character, 3);
-                p += 3;
+                encoded_to_buf(p, flb_utils_replace_fragment_start,


Shouldn't you calculate the length of the whole string you're injecting and verify if there's enough space available before doing anything in this path? Unless I'm wrong there are 34 characters not including the hex encoded bytes and spaces between them so it's substantially higher than what's ensured before.

No, this function does not take out buffer size as a parameter, and so there is no way to verify the expanded size will fit in the out buffer. Please see the comment here: #4297 (comment)

matthewfala · 2021-11-17T00:42:26Z

How about converting the invalid fragment: 0xff 0x81 to 

I hate making up our own convention, so I spent some time today looking for existing conventions that exist for dealing with invalid unicode char points.

It seems like by convention, the replacement character was created for this very purpose of replacing invalid sequences. It may be that to prevent loss, we need to come up with our own convention, but it could get confusing to use the replacement character since it exists for the the purpose of my original PR (see the following link)

http://www.unicode.org/L2/L2017/17344r2-clarify-utf8.pdf
http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf (see page 129 on converting Ill-Formed Sequences)

While there isn’t a set convention for what we want (printing out escaped hex) There actually is a range of unicode characters which are reserved for private interpretation. I suggest we use these to represent our escaped hex, more specifically U+E000 to U+E0FF which renders as  to 

https://en.wikipedia.org/wiki/Private_Use_Areas

This is compact, preserves data, and would probably not be found in normal log since these characters are found in the private use area (though like the replacement character there is no guarantee).

It also matches the hex escape going from 1 invalid byte to 6 expanded bytes (If we decide to represent as an escaped code point \uE0XX or 3 bytes if we decide to represent as an actual unicode character)

leonardo-albertovich · 2021-11-17T10:56:24Z

Hi Matt, that's much better than what we previously discussed, it'd take the overhead of the long sequence down immensely so please go ahead with it, thanks for going the extra mile, it really makes a difference!

matthewfala · 2021-11-17T22:55:50Z

@leonardo-albertovich
Hi Leonardo. I implemented using the hex mapping to unicode private use area characters U+E000 to U+E0FF. Would you please take a look when you get the chance?

I think the test cases are complete now. I added out buffer overrun tests + code to handle. If you can think of any other test cases, please let me know.

leonardo-albertovich · 2021-11-18T11:17:45Z

The code looks good to me, could you fix the DCO issue? then we'll be pretty much ready to close this up.

Previously with unicode byte sequences such as 0xef 0xbf 0x00 ... Fluent Bit would blindly trust the first unicode byte 0xef to describe how many valid trailing unicode bytes to copy. If a trailing unicode byte is invalid, such as 0x00, the null character, the utility blindly copied this to the escaped string. This commit adds checks for leading and trailing byte utf-8 compliance. If invalid, the ill-formed character's bytes are individually mapped to private use area [U+E000 to U+E0FF] preserving ill-formed character data in a compact and safe utf-8 friendly format. Signed-off-by: Matthew Fala <[email protected]>

matthewfala · 2021-11-18T19:46:37Z

I squashed the commits and signed off.

matthewfala · 2021-11-18T20:02:18Z

This is ~~merged~~ merging into 1.8. I'll make another PR to master since you think this PR is complete.

matthewfala · 2021-11-18T20:30:47Z

Please see PR #4346 for inclusion of this commit into master.

leonardo-albertovich

This looks good to me, as long as it passes the tests it's got my approval.

src/flb_utils.c

krispraws

I couldn’t leave a comment on the line because it was not changed in this commit, but the cast on line 659/670 looks a bit iffy
c = (uint32_t) str[i];
I think it should be c = (uint32_t) (unsigned char) str[i];
I’ll test and confirm later.
Did you test with invalid 2 byte sequences? I think you may see more issues.

matthewfala

c = (uint32_t) (unsigned char) str[i]; Please refer to response to question 2.

src/flb_utils.c

krispraws · 2021-11-22T20:18:28Z

Unfortunately, I do not have the time or bandwidth to run tests. Since you have already decided on these changes, go ahead with your merge.

PettitWesley

I don't really have anything useful to add here except that its probably worthwhile to add something in the docs about how Fluent Bit handles invalid utf-8 for users.

And possibly (if time allows), may be some deeper details in the dev guide if you think that might help future devs (both folks modifying this code, and folks writing plugins that handle utf 8): https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md

edsiper · 2021-11-29T22:41:02Z

thanks!

note: please prefix commits only with utils: .... (instead of the function being modified, just the interface name without the flb_ prefix)

github-actions bot added the docs-required label Nov 9, 2021

matthewfala commented Nov 9, 2021

View reviewed changes

src/flb_utils.c Outdated Show resolved Hide resolved

zhonghui12 approved these changes Nov 9, 2021

View reviewed changes

PettitWesley approved these changes Nov 10, 2021

View reviewed changes

matthewfala force-pushed the cloudwatch-serialization-exception branch from 7b27898 to d588bee Compare November 10, 2021 02:03

matthewfala force-pushed the cloudwatch-serialization-exception branch 4 times, most recently from 0634391 to 184a49b Compare November 12, 2021 02:08

leonardo-albertovich reviewed Nov 12, 2021

View reviewed changes

leonardo-albertovich reviewed Nov 16, 2021

View reviewed changes

matthewfala force-pushed the cloudwatch-serialization-exception branch 2 times, most recently from f097f40 to 6e0fe88 Compare November 17, 2021 22:49

matthewfala force-pushed the cloudwatch-serialization-exception branch from da01c17 to 8eb5c8c Compare November 18, 2021 19:37

matthewfala changed the title ~~flb_utils_write_str: omit unicode with malform trailing bytes~~ flb_utils_write_str: detect and replace ill-formed utf-8 bytes Nov 18, 2021

matthewfala mentioned this pull request Nov 18, 2021

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> master #4346

Merged

matthewfala changed the title ~~flb_utils_write_str: detect and replace ill-formed utf-8 bytes~~ flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> 1.8 Nov 18, 2021

leonardo-albertovich approved these changes Nov 19, 2021

View reviewed changes

krispraws reviewed Nov 20, 2021

View reviewed changes

src/flb_utils.c Show resolved Hide resolved

krispraws reviewed Nov 20, 2021

View reviewed changes

src/flb_utils.c Show resolved Hide resolved

krispraws reviewed Nov 20, 2021

View reviewed changes

matthewfala commented Nov 22, 2021

View reviewed changes

src/flb_utils.c Show resolved Hide resolved

src/flb_utils.c Show resolved Hide resolved

PettitWesley reviewed Nov 22, 2021

View reviewed changes

leonardo-albertovich approved these changes Nov 24, 2021

View reviewed changes

edsiper merged commit b5538fc into fluent:1.8 Nov 29, 2021

		@@ -744,23 +746,38 @@ int flb_utils_write_str(char buf, int off, size_t size,

		is_valid = FLB_TRUE;
		for (b = 0; b < hex_bytes; b++) {

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> 1.8 #4297

flb_utils_write_str: detect and replace ill-formed utf-8 bytes -> 1.8 #4297

Conversation

matthewfala commented Nov 9, 2021 • edited Loading

Version 1 (See comments for documentation on revisions)

matthewfala commented Nov 9, 2021

PettitWesley left a comment

Choose a reason for hiding this comment

leonardo-albertovich commented Nov 10, 2021

matthewfala commented Nov 11, 2021

leonardo-albertovich commented Nov 11, 2021

matthewfala commented Nov 11, 2021

matthewfala commented Nov 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leonardo-albertovich Nov 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewfala commented Nov 16, 2021 • edited Loading

matthewfala commented Nov 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewfala commented Nov 17, 2021

leonardo-albertovich commented Nov 17, 2021

matthewfala commented Nov 17, 2021

leonardo-albertovich commented Nov 18, 2021

matthewfala commented Nov 18, 2021

matthewfala commented Nov 18, 2021 • edited Loading

matthewfala commented Nov 18, 2021 • edited Loading

leonardo-albertovich left a comment

Choose a reason for hiding this comment

krispraws left a comment

Choose a reason for hiding this comment

matthewfala left a comment

Choose a reason for hiding this comment

krispraws commented Nov 22, 2021

PettitWesley left a comment

Choose a reason for hiding this comment

edsiper commented Nov 29, 2021

matthewfala commented Nov 9, 2021 •

edited

Loading

leonardo-albertovich Nov 15, 2021 •

edited

Loading

matthewfala commented Nov 16, 2021 •

edited

Loading

matthewfala commented Nov 18, 2021 •

edited

Loading

matthewfala commented Nov 18, 2021 •

edited

Loading