Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preventing cutting unicode characters in half #77

Merged
merged 9 commits into from
Jan 20, 2020
Merged

Preventing cutting unicode characters in half #77

merged 9 commits into from
Jan 20, 2020

Conversation

shiroyasha
Copy link
Contributor

@shiroyasha shiroyasha commented Jan 10, 2020

Now, that we are no longer waiting for the newline character to flush the output buffer, a new bug got introduced. We are cutting Unicode characters in-half.

Reminder. A UTF-8 character can span to up to 4 bytes. The highest bit in the (8th bit) marks if the character is finished or there is more upcoming information in the next byte.

Example:

A UTF-8 character spanning 3 bytes:

10101010 -> 10101010 -> 00101010

A UTF-8 character spanning 1 byte:

00101010

@shiroyasha
Copy link
Contributor Author

@DamjanBecirovic @radwo @bmarkons this is a pretty hard bug to fix. So far, this is my best idea. I would appreciate your review.

@shiroyasha shiroyasha force-pushed the unicode branch 4 times, most recently from 949f0b7 to 5c8e9a4 Compare January 20, 2020 14:42
@shiroyasha shiroyasha force-pushed the unicode branch 3 times, most recently from 44d47a8 to 10460eb Compare January 20, 2020 15:16
@shiroyasha shiroyasha merged commit 50fc355 into master Jan 20, 2020
@shiroyasha shiroyasha deleted the unicode branch January 20, 2020 15:32
//
// An unicode sequence can't be longer than 4 bytes
//
unicodeContinuationMask := uint(1 << 7)
Copy link

@dnozay dnozay Jan 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image credit: https://en.wikipedia.org/wiki/UTF-8

image

bug: continuation bytes start with 10xxxxxx; 1xxxxxxx also include leading bytes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens very frequently when doing box drawing

┏┳━━━━━━━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━���━━
┃┃ TEST Report
┗┻━━━━━━━━━━━━━━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━���━━━━━━━━━━━━━━━━━━━━━━━━━━━

Box drawing characters are 3 bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's it! Thanks for finding the bug and sharing the solution. 🙇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work on the fix started here #79.

Copy link
Contributor Author

@shiroyasha shiroyasha Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnozay I started with the implementation of the patch, but unfortunately, I had a hard time replicating the issue in the Agent's tests.

My current assumption is that the root of the problem is not directly in the Agent, but upstream in the log processing service also. We are investigating further.

I'll update as soon as we find something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants