Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip,src: add utf8 consumer/validator #1319

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 179 additions & 0 deletions src/util.cc
Original file line number Diff line number Diff line change
@@ -1,8 +1,187 @@
#include "util.h"
#include "string_bytes.h"

#define UNI_SUR_HIGH_START 0xD800UL
#define UNI_SUR_LOW_END 0xDFFFUL
#define UNI_REPLACEMENT_CHAR 0x0000FFFDUL
#define UNI_MAX_LEGAL_UTF32 0x0010FFFFUL

inline uint32_t log2(uint8_t v) {
const uint32_t r = (v > 15) << 2;
v >>= r;
const uint32_t s = (v > 3) << 1;
v >>= s;
v >>= 1;
return r | s | v;
}

inline uint32_t clz(uint8_t v) {
return 7 - log2(v);
}

namespace node {

typedef size_t (*ErrorStrategy)(size_t, const uint8_t*, size_t);
typedef void (*GlyphStrategy)(const uint8_t*, size_t, uint32_t, size_t);


inline size_t Skip(
const size_t remaining,
const uint8_t* input,
const size_t glyph_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style issue: the first argument goes on the same line as the function name and the other arguments should line up below it. The only time that's deviated from is when the 80 column limit is exceeded.

if (remaining > glyph_size) {
return 1;
}
return 0;
}


inline size_t Halt(
const size_t remaining,
const uint8_t* input,
const size_t glyph_size) {
return 0;
}


inline void DiscardGlyph(
const uint8_t* glyph,
const size_t glyph_size,
const uint32_t glyph_value,
const size_t pos) {
}


inline bool IsLegalUtf8Glyph(const uint8_t* input, const size_t length) {
uint8_t acc;
const uint8_t* srcptr = input + length;
switch (length) {
default: return false;
case 4:
acc = (*--srcptr);
if (acc < 0x80 || acc > 0xBF)
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add // Fall through. comments? That makes it clear to the reader that the lack of break is intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

// fall-through
case 3:
acc = (*--srcptr);
if (acc < 0x80 || acc > 0xBF)
return false;
// fall-through
case 2:
acc = (*--srcptr);
if (acc > 0xBF)
return false;
switch (*input) {
case 0xE0:
if (acc < 0xA0)
return false;
break;
case 0xED:
if (acc > 0x9F)
return false;
break;
case 0xF0:
if (acc < 0x90)
return false;
break;
case 0xF4:
if (acc > 0x8F)
return false;
break;
default:
if (acc < 0x80)
return false;
}
// fall-through
case 1:
if (*input >= 0x80 && *input < 0xC2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0xC2? Shouldn't that be 0xC0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I'll look into that – the original version has it as 0xC2.

return false;
}
}
return *input <= 0xF4;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from possible bugs in the implementation, I suspect that this function pessimizes the common case of input < 0x80.



static const uint32_t offsets_from_utf8[6] = {
0x00000000, 0x00003080, 0x000E2080,
0x03C82080, 0xFA082080, 0x82082080
};


template <ErrorStrategy OnError, typename GlyphStrategy>
inline size_t Utf8Consume(
const uint8_t* const input,
const size_t length,
const GlyphStrategy OnGlyph) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why don't you encode the callback in the template definition? Right now, it's always passed as the third argument. The type variable GlyphStrategy just shadows the typedef of the same name here.

size_t idx = 0;
while (idx < length) {
size_t advance = 0;
uint32_t glyph = 0;
uint8_t extrabytes = input[idx] ? clz(~input[idx]) : 0;
size_t i = idx;

if (extrabytes + idx > length) {
advance = OnError(length - idx, input, extrabytes);
} else if (!IsLegalUtf8Glyph(input + idx, extrabytes + 1)) {
advance = OnError(length - idx, input, extrabytes);
} else {
ASSERT(extrabytes < 4);
switch (extrabytes) {
case 3:
glyph += input[i++];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I brought it up before but shouldn't the high bits be masked off here and below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I get it now. The glyph value is decremented with a value from the offsets_from_utf8 lookup table. So to take an example, \u7FF is encoded as DF BF. (0xDF << 6) + 0xBF == 0x37FF - 0x3080 == 0x7FF.

It's clever but in a very roundabout way. What's the motivation for doing it like this?

glyph <<= 6;
// fall-through
case 2:
glyph += input[i++];
glyph <<= 6;
// fall-through
case 1:
glyph += input[i++];
glyph <<= 6;
// fall-through
case 0:
glyph += input[i];
}

glyph -= offsets_from_utf8[extrabytes];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of bounds read when extrabytes > 5 (which looks to be possible here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be caught by IsLegalUtf8Glyph – extrabytes > 5 will hit the default case of that function's switch statement and return false. Should I add a note about that at the start of this branch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. That's not obvious from looking at the code so yes, a comment would help. Perhaps even better is a CHECK or an ASSERT.


if (glyph > UNI_MAX_LEGAL_UTF32 ||
(glyph >= UNI_SUR_HIGH_START && glyph <= UNI_SUR_LOW_END)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800 :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800 :-)

Compilers are pretty good in replacing idiomatic range comparisons with bit hacks. Have you checked whether it's really necessary to have these ... ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Har, har.

(But it's true that both clang and gcc manage to pull it off.)

advance = OnError(length - idx, input, extrabytes);
} else {
advance = extrabytes + 1;
OnGlyph(input + idx, extrabytes + 1, glyph, idx);
}
}

if (advance == 0) {
break;
}
idx += advance;
}
return idx;
}


size_t Utf8Value::StripInvalidUtf8Glyphs(uint8_t* const input, const size_t size) {
size_t idx = 0;
auto on_glyph = [input, &idx](
const uint8_t* data, size_t size, uint32_t glyph, size_t pos) {
size_t old_idx = idx;
idx += size;
if (old_idx == pos) return;
memmove(input + old_idx, data, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like repeatedly stripping bad character sequences can result in approx. O(n^2) bytes getting copied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a proof-of-concept API addition. Ultimately I'll remove this and replace it with something that emits U+FFFD chars on error. @petkaantonov requested that it add replacement characters instead of stripping.

The eventual target APIs are buffer.isLegalUtf8() and (I think!) buffer.replaceInvalidUtf8(), leading up to a utf8 -> utf16 translator.

};

return Utf8Consume<Skip>(input, size, on_glyph);
}


bool Utf8Value::IsValidUtf8(const uint8_t * const input, const size_t size) {
return Utf8Consume<Halt>(input, size, DiscardGlyph) == size;
}


Utf8Value::Utf8Value(v8::Isolate* isolate, v8::Handle<v8::Value> value)
: length_(0), str_(str_st_) {
if (value.IsEmpty())
Expand Down
2 changes: 2 additions & 0 deletions src/util.h
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,8 @@ class Utf8Value {
return length_;
};

static bool IsValidUtf8(const uint8_t* const, const size_t);
static size_t StripInvalidUtf8Glyphs(uint8_t* const, const size_t);
private:
size_t length_;
char* str_;
Expand Down