Handle CDATA with UTF-8 characters when partial parsing #133

tomtaylor · 2024-08-05T08:17:32Z

Follows on from #122.

In partial mode, UTF-8 encoded characters might be split across multiple chunks. When this happens for a character such as £, which is encoded as <<0xC2, 0xA3>>, the 0xC2 is neither an ASCII character (<= 127), nor does it match the <<codepoint::utf-8>> clause, and Saxy throws a parser error.

This fixes that by just parsing all the bytes inside a CDATA element regardless of their code point. It drops the UTF-8 character optimisation, but I suspect that's probably a minor performance improvement for most documents.

@qcam is this a more prevalent issue than my use case? I can see why matching on UTF-8 codepoint and swallowing the whole character is a nice optimisation, but I wonder if it might cause issues in other places when partial parsing.

Don't assume that we're always seeing a full UTF-8 character. In partial mode, UTF-8 encoded characters might be split across multiple chunks.

tomtaylor · 2024-08-13T06:42:48Z

@qcam any thoughts on this?

qcam · 2024-10-22T13:38:34Z

lib/saxy/parser/builder.ex

            element_cdata(rest, more?, original, pos, state, len + 1)

-          <<codepoint::utf8>> <> rest ->
-            element_cdata(rest, more?, original, pos, state, len + Utils.compute_char_len(codepoint))


I think we can the same way how dangling UTF-8 fragments is handled

For example https://github.com/qcam/saxy/blob/master/lib/saxy/parser/builder.ex#L540-L541

Handle CDATA containing partial UTF-8 characters

9475427

Don't assume that we're always seeing a full UTF-8 character. In partial mode, UTF-8 encoded characters might be split across multiple chunks.

tomtaylor changed the title ~~Handle CDATA containing partial UTF-8 characters~~ Handle CDATA with UTF-8 characters when partial parsing Aug 5, 2024

tomtaylor mentioned this pull request Aug 5, 2024

CDATA element fails to parse when element contains £ symbol #122

Open

qcam reviewed Oct 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle CDATA with UTF-8 characters when partial parsing #133

Handle CDATA with UTF-8 characters when partial parsing #133

tomtaylor commented Aug 5, 2024 •

edited

Loading

tomtaylor commented Aug 13, 2024

qcam Oct 22, 2024

Handle CDATA with UTF-8 characters when partial parsing #133

Are you sure you want to change the base?

Handle CDATA with UTF-8 characters when partial parsing #133

Conversation

tomtaylor commented Aug 5, 2024 • edited Loading

tomtaylor commented Aug 13, 2024

qcam Oct 22, 2024

Choose a reason for hiding this comment

tomtaylor commented Aug 5, 2024 •

edited

Loading