-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preventing cutting unicode characters in half #77
Conversation
@DamjanBecirovic @radwo @bmarkons this is a pretty hard bug to fix. So far, this is my best idea. I would appreciate your review. |
949f0b7
to
5c8e9a4
Compare
44d47a8
to
10460eb
Compare
// | ||
// An unicode sequence can't be longer than 4 bytes | ||
// | ||
unicodeContinuationMask := uint(1 << 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image credit: https://en.wikipedia.org/wiki/UTF-8
bug: continuation bytes start with 10xxxxxx
; 1xxxxxxx
also include leading bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens very frequently when doing box drawing
┏┳━━━━━━━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━���━━
┃┃ TEST Report
┗┻━━━━━━━━━━━━━━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━
Box drawing characters are 3 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's it! Thanks for finding the bug and sharing the solution. 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Work on the fix started here #79.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dnozay I started with the implementation of the patch, but unfortunately, I had a hard time replicating the issue in the Agent's tests.
My current assumption is that the root of the problem is not directly in the Agent, but upstream in the log processing service also. We are investigating further.
I'll update as soon as we find something.
Now, that we are no longer waiting for the newline character to flush the output buffer, a new bug got introduced. We are cutting Unicode characters in-half.
Reminder. A UTF-8 character can span to up to 4 bytes. The highest bit in the (8th bit) marks if the character is finished or there is more upcoming information in the next byte.
Example:
A UTF-8 character spanning 3 bytes:
10101010 -> 10101010 -> 00101010
A UTF-8 character spanning 1 byte:
00101010