wip,src: add utf8 consumer/validator #1319

chrisdickinson · 2015-04-01T20:46:22Z

WIP. Opening this now to double check that this is headed in the right direction.

This is based on utf-8-validate. The primary differences are the use of clz (vs. a lookup table) to compute the number of extra bytes and the introduction of the error/glyph callbacks.

The Utf8Consume function will iterate over valid glyphs, calling a provided OnGlyph callback and an OnError callback as necessary. Provided error strategies are "Halt" and "Skip" – which either halts the consumer at the first error, or skips past them as appropriate.

The strategies were added to accommodate the desire to build a utf8-to-utf16 translator as part of this work.

bnoordhuis · 2015-04-01T20:56:40Z

src/util.cc

+#define UNI_SUR_HIGH_START    (uint32_t) 0xD800
+#define UNI_SUR_LOW_END       (uint32_t) 0xDFFF
+#define UNI_REPLACEMENT_CHAR  (uint32_t) 0x0000FFFD
+#define UNI_MAX_LEGAL_UTF32   (uint32_t) 0x0010FFFF


Style: no C-style casts (here and elsewhere.)

Fixed. Sorry about that, missed while porting.

bnoordhuis · 2015-04-01T21:12:58Z

src/util.cc

+            return false;
+      }
+    case 1:
+      if (*input >= 0x80 && *input < 0xC2) {


0xC2? Shouldn't that be 0xC0?

Hm. I'll look into that – the original version has it as 0xC2.

piscisaureus · 2015-04-01T21:30:43Z

The msvc "equivalent" is _BitScanReverse. https://github.com/libuv/libuv/blob/3346082132dc2ff809dfd5d25d451c0b33905f53/src/win/tty.c#L1442

Edit: you already knew :) sorry

bnoordhuis · 2015-04-01T21:32:24Z

src/util.cc

+  while (idx < length) {
+    size_t advance = 0;
+    uint32_t glyph = 0;
+    uint8_t extrabytes = (uint8_t)clz(~(static_cast<int>(input[idx])<<24));


I think what you want here can be implemented portable and reasonably efficient using the following:

inline uint32_t log2(uint8_t v) { const uint32_t r = (v > 15) << 2; v >>= r; const uint32_t s = (v > 3) << 1; v >>= s; v >>= 1; return r | s | v; } inline uint32_t clz(uint8_t v) { // clz(0) == 7. Add a zero check if that's an issue. return 7 - log2(v); }

Forgot to mention, the behavior of static_cast<int>(...) << 24 is implementation-defined on platforms where ints are 32 bits (i.e. all of them.) You're not allowed to shift values into the sign bit.

bnoordhuis · 2015-04-01T21:48:44Z

src/util.cc

+    } else {
+      switch (extrabytes) {
+        case 5:
+          glyph += (uint8_t) input[i++];


Shouldn't this mask off the high bits? Also, when is extrabytes == 5 for valid UTF-8?

bnoordhuis · 2015-04-01T21:57:10Z

src/util.cc

+      glyph -= offsets_from_utf8[extrabytes];
+
+      if (glyph > UNI_MAX_LEGAL_UTF32 ||
+          (glyph >= UNI_SUR_HIGH_START && glyph <= UNI_SUR_LOW_END)) {


Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800 :-)

Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800 :-)

Compilers are pretty good in replacing idiomatic range comparisons with bit hacks. Have you checked whether it's really necessary to have these ... ?

Har, har.

(But it's true that both clang and gcc manage to pull it off.)

jbergstroem · 2015-04-02T00:26:52Z

Now that #1199 landed -- could we perhaps add unit tests for this?

chrisdickinson · 2015-04-02T03:04:54Z

Nixed clz support, since a quick benchmark showed that @bnoordhuis' implementation was faster.

bnoordhuis · 2015-04-02T22:41:30Z

src/util.cc

+inline size_t Skip(
+    const size_t remaining,
+    const uint8_t* input,
+    const size_t glyph_size) {


Style issue: the first argument goes on the same line as the function name and the other arguments should line up below it. The only time that's deviated from is when the 80 column limit is exceeded.

bnoordhuis · 2015-04-02T23:13:20Z

I second @jbergstroem's sentiment on unit tests. :-)

petkaantonov · 2015-04-03T18:36:30Z

Also always interesting to run on an utf-8 decoder is the utf8 decoder stress test

Fishrock123 · 2015-05-25T21:48:42Z

So I think I heard at some point v8 has a version of this for TypedArrays? Am I mistaken or will this soon be obsolete?

Fishrock123 · 2015-09-14T13:52:07Z

@trevnorris does TypedArrays give us any sort of this functionality?

trevnorris · 2015-09-14T23:03:35Z

Since this isn't used anywhere not even sure what it's for, but typed arrays don't have anything like this afaik.

jasnell · 2015-10-22T01:59:05Z

@chrisdickinson @trevnorris ... what the status on this one?

… completely

chrisdickinson · 2016-01-30T08:33:46Z

Closing this due to inactivity.

wip,src: add utf8 consumer/validator

58c62a2

bnoordhuis reviewed Apr 1, 2015
View reviewed changes

nix C-style casts

e7918fe

bnoordhuis reviewed Apr 1, 2015
View reviewed changes

chrisdickinson added 2 commits April 1, 2015 14:42

style fixes, switch from char* to uint8_t*

09cd52f

switch remaining functions from static to inline

5cb6d3d

bnoordhuis reviewed Apr 1, 2015
View reviewed changes

mscdex added the c++ Issues and PRs that require attention from people who are familiar with C++. label Apr 1, 2015

bnoordhuis reviewed Apr 1, 2015
View reviewed changes

chrisdickinson added 2 commits April 1, 2015 19:11

review fixes

7cd5465

nix clz – turns out @bnoordhuis' implementation is faster 🏁

e4bc82f

bnoordhuis reviewed Apr 2, 2015
View reviewed changes

chrisdickinson added the land-on-master label Apr 17, 2015

brendanashworth added the wip Issues and PRs that are still a work in progress. label May 11, 2015

jasnell added the stalled Issues and PRs that are stalled. label Nov 16, 2015

SEAPUNK referenced this pull request in websockets/ws Dec 22, 2015

[major] No longer use the binary addons as optional dependency, nuked…

49b1109

… completely

chrisdickinson closed this Jan 30, 2016

pr-preview bot mentioned this pull request Nov 1, 2022

Add the Priority Hints changes to the fetch spec whatwg/fetch#1523

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip,src: add utf8 consumer/validator #1319

wip,src: add utf8 consumer/validator #1319

chrisdickinson commented Apr 1, 2015

bnoordhuis Apr 1, 2015

chrisdickinson Apr 1, 2015

bnoordhuis Apr 1, 2015

chrisdickinson Apr 1, 2015

piscisaureus commented Apr 1, 2015

bnoordhuis Apr 1, 2015

bnoordhuis Apr 1, 2015

bnoordhuis Apr 1, 2015

bnoordhuis Apr 1, 2015

piscisaureus Apr 1, 2015

bnoordhuis Apr 1, 2015

jbergstroem commented Apr 2, 2015

chrisdickinson commented Apr 2, 2015

bnoordhuis Apr 2, 2015

bnoordhuis commented Apr 2, 2015

petkaantonov commented Apr 3, 2015

Fishrock123 commented May 25, 2015

Fishrock123 commented Sep 14, 2015

trevnorris commented Sep 14, 2015

jasnell commented Oct 22, 2015

chrisdickinson commented Jan 30, 2016

wip,src: add utf8 consumer/validator #1319

wip,src: add utf8 consumer/validator #1319

Conversation

chrisdickinson commented Apr 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piscisaureus commented Apr 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbergstroem commented Apr 2, 2015

chrisdickinson commented Apr 2, 2015

Choose a reason for hiding this comment

bnoordhuis commented Apr 2, 2015

petkaantonov commented Apr 3, 2015

Fishrock123 commented May 25, 2015

Fishrock123 commented Sep 14, 2015

trevnorris commented Sep 14, 2015

jasnell commented Oct 22, 2015

chrisdickinson commented Jan 30, 2016