Preventing cutting unicode characters in half #77

shiroyasha · 2020-01-10T14:08:04Z

Now, that we are no longer waiting for the newline character to flush the output buffer, a new bug got introduced. We are cutting Unicode characters in-half.

Reminder. A UTF-8 character can span to up to 4 bytes. The highest bit in the (8th bit) marks if the character is finished or there is more upcoming information in the next byte.

Example:

A UTF-8 character spanning 3 bytes:

10101010 -> 10101010 -> 00101010

A UTF-8 character spanning 1 byte:

00101010

shiroyasha · 2020-01-10T14:19:09Z

@DamjanBecirovic @radwo @bmarkons this is a pretty hard bug to fix. So far, this is my best idea. I would appreciate your review.

dnozay · 2020-01-24T17:35:15Z

pkg/shell/process.go

+			//
+			// An unicode sequence can't be longer than 4 bytes
+			//
+			unicodeContinuationMask := uint(1 << 7)


image credit: https://en.wikipedia.org/wiki/UTF-8

bug: continuation bytes start with 10xxxxxx; 1xxxxxxx also include leading bytes.

This happens very frequently when doing box drawing

┏┳━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━��━━ ┃┃ TEST Report ┗┻━━━━━━━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━��━━━━━━━━━━━━━━━━━━━━━━━━━━━

Box drawing characters are 3 bytes.

That's it! Thanks for finding the bug and sharing the solution. 🙇

Work on the fix started here #79.

@dnozay I started with the implementation of the patch, but unfortunately, I had a hard time replicating the issue in the Agent's tests.

My current assumption is that the root of the problem is not directly in the Agent, but upstream in the log processing service also. We are investigating further.

I'll update as soon as we find something.

shiroyasha added 2 commits January 10, 2020 13:48

Repreoducable unicode bug

cb92ccb

Preventing cutting unicode characters in half

7e3ee6a

shiroyasha requested review from bmarkons, DamjanBecirovic and radwo January 10, 2020 14:17

shiroyasha added 5 commits January 17, 2020 14:04

Bitshift by 7 places, not by 8 places

a869032

Green tests for Docker x Unicode support

48d8fd4

Run tests on Semaphore

9fd9fe6

Green tests for Shell x Unicode

58691a9

Fix multi-byte output code parsing

92408dd

shiroyasha force-pushed the unicode branch 4 times, most recently from 949f0b7 to 5c8e9a4 Compare January 20, 2020 14:42

Handle cases when the unicode chars are broken

fd21f92

shiroyasha force-pushed the unicode branch 3 times, most recently from 44d47a8 to 10460eb Compare January 20, 2020 15:16

Build matrix

2e81781

shiroyasha force-pushed the unicode branch from 10460eb to 2e81781 Compare January 20, 2020 15:21

shiroyasha merged commit 50fc355 into master Jan 20, 2020

shiroyasha deleted the unicode branch January 20, 2020 15:32

dnozay reviewed Jan 24, 2020

View reviewed changes

shiroyasha mentioned this pull request Jan 27, 2020

Fix for unicode box drawing characters #79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preventing cutting unicode characters in half #77

Preventing cutting unicode characters in half #77

shiroyasha commented Jan 10, 2020 •

edited

Loading

shiroyasha commented Jan 10, 2020

dnozay Jan 24, 2020 •

edited

Loading

dnozay Jan 24, 2020

shiroyasha Jan 24, 2020

shiroyasha Jan 27, 2020

shiroyasha Jan 30, 2020 •

edited

Loading

Preventing cutting unicode characters in half #77

Preventing cutting unicode characters in half #77

Conversation

shiroyasha commented Jan 10, 2020 • edited Loading

shiroyasha commented Jan 10, 2020

dnozay Jan 24, 2020 • edited Loading

Choose a reason for hiding this comment

dnozay Jan 24, 2020

Choose a reason for hiding this comment

shiroyasha Jan 24, 2020

Choose a reason for hiding this comment

shiroyasha Jan 27, 2020

Choose a reason for hiding this comment

shiroyasha Jan 30, 2020 • edited Loading

Choose a reason for hiding this comment

shiroyasha commented Jan 10, 2020 •

edited

Loading

dnozay Jan 24, 2020 •

edited

Loading

shiroyasha Jan 30, 2020 •

edited

Loading