Skip to content

Commit

Permalink
feat: saxes handles chunks that "break" unicode
Browse files Browse the repository at this point in the history
The way JavaScript handles unicode it is possible for a unicode string to be
split in two such that a unicode code point is split into two parts: e.g. ``
"\u{1F4A9}"[0]; "\u{1F4A9}"[1]`` If a large piece of data was cut up into
smaller chunks that were fed in sequence to saxes, and it so happened that a
unicode character was chopped up like illustrated above, then saxes would just
raise an error and fail. Saxes now detects when a chunk ends with a surrogate
and carries it over to the next chunk.
  • Loading branch information
lddubeau committed Oct 2, 2019
1 parent 7b3db75 commit 1272448
Show file tree
Hide file tree
Showing 3 changed files with 58 additions and 10 deletions.
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,18 @@ want. `additionalNamespaces` applies before `resolvePrefix`.
converted to ``\u000A`` prior to parsing. The optimal code path for saxes is a
file in which all end of line characters are already ``\u000A``.

* Don't split Unicode strings you feed to saxes across surrogates. When you
naively split a string in JavaScript, you run the risk of splitting a Unicode
character into two surrogates. e.g. In the following example ``a`` and ``b``
each contain half of a single Unicode character: ``const a = "\u{1F4A9}"[0];
const b = "\u{1F4A9}"[1]`` If you feed such split surrogates to versions of
saxes prior to 4, you'd get errors. Saxes version 4 and over are able to
detect when a chunk of data ends with a surrogate and carry over the surrogate
to the next chunk. However this operation entails slicing and concatenating
strings. If you can feed your data in a way that does not split surrogates,
you should do it. (Obviously, feeding all the data at once with a single write
is fastest.)

## FAQ

Q. Why has saxes dropped support for limiting the size of data chunks passed to
Expand Down
24 changes: 14 additions & 10 deletions lib/saxes.js
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,7 @@ class SaxesParser {
// effects.
//
this.prevI = 0;
this.trailingCR = false;
this.carriedFromPrevious = undefined;
this.originalNL = true;
this.forbiddenState = FORBIDDEN_START;
/**
Expand Down Expand Up @@ -573,20 +573,24 @@ class SaxesParser {
// isn't. (There may be Node-specific code that would perform faster than
// ``Array.from`` but don't want to be dependent on Node.)

if (this.trailingCR) {
// The previous chunk had a trailing cr. We need to handle it now.
chunk = `\r${chunk}`;
this.trailingCR = false;
if (this.carriedFromPrevious !== undefined) {
// The previous chunk had char we must carry over.
chunk = `${this.carriedFromPrevious}${chunk}`;
this.carriedFromPrevious = undefined;
}

let limit = chunk.length;
if (!end && chunk[limit - 1] === "\r") {
// The chunk ends with a trailing CR. We cannot know how to handle it
// until we get the next chunk or the end of the stream. So save it for
// later.
const lastCode = chunk.charCodeAt(limit - 1);
if (!end &&
// A trailing CR or surrogate must be carried over to the next
// chunk.
(lastCode === CR || (lastCode >= 0xD800 && lastCode <= 0xDBFF))) {
// The chunk ends with a character that must be carried over. We cannot
// know how to handle it until we get the next chunk or the end of the
// stream. So save it for later.
this.carriedFromPrevious = chunk[limit - 1];
limit--;
chunk = chunk.slice(0, limit);
this.trailingCR = true;
}

this.chunk = chunk;
Expand Down
32 changes: 32 additions & 0 deletions test/unicode.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"use strict";

const { test } = require(".");

describe("unicode test", () => {
describe("poop", () => {
const xml = "<a>💩</a>";
const expect = [
["opentagstart", { name: "a", attributes: {} }],
["opentag", { name: "a", attributes: {}, isSelfClosing: false }],
["text", "💩"],
["closetag", { name: "a", attributes: {}, isSelfClosing: false }],
];

test({
name: "intact",
xml,
expect,
});

test({
name: "sliced",
fn(parser) {
// This test purposely slices the string into the poop character.
parser.write(xml.slice(0, 4));
parser.write(xml.slice(4));
parser.close();
},
expect,
});
});
});

0 comments on commit 1272448

Please sign in to comment.