Invalid xmlChar value 55357, line 49, column 101 #1

kmanwar89 · 2018-01-13T15:07:35Z

Hi there,

Ran your script and received the following error text as output:

XXXXXXX@ubuntu:~/Desktop/Untitled Folder/smsxml2html-master$ python smsxml2html.py -o ~/Desktop -n 11111111111 sms-20171119175633.xml
Traceback (most recent call last):
File "smsxml2html.py", line 241, in
main()
File "smsxml2html.py", line 223, in main
tree = etree.parse(input, parser = lxml_parser)
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:85131)
File "src/lxml/parser.pxi", line 1782, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:124005)
File "src/lxml/parser.pxi", line 1808, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:124374)
File "src/lxml/parser.pxi", line 1712, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:123169)
File "src/lxml/parser.pxi", line 1115, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:117533)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:110510)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:112276)
File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:111124)
lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value 55357, line 49, column 101

Any help would be appreciated. I ran this on an Ubuntu 16.04.1 VM and received the same output whether using Python 2.X or 3.X.

KermMartian · 2018-01-13T22:34:19Z

Thanks! It sounds like your input XML contains https://apps.timwhitlock.info/unicode/inspect?s=%F0%9F%98%84 (the 😄 character), and I'm not handling that properly. Can you please view the XML in a Unicode-aware text editor, check line 49, column 101, and confirm that that's what it contains?

T2Fr · 2018-08-14T21:15:36Z

Seems to be working with lxml_parser = XMLParser(huge_tree = True, recover = True)

akobel · 2018-08-22T09:13:02Z

Had a pile of poo here, literally: 💩 (which, apparently like anything including surrogate pairs, causes trouble). It was encoded by &#55357;&#56489;. Other than that, some more smileys.

The following piece of code made it work for me:

from lxml.etree import XMLParser, parse
from io import BytesIO
lxml_parser = XMLParser(huge_tree = True)

payload = open (input, 'rb').read().decode ('utf-8')
chrifnotspecial = lambda dec: '&#%d;' % dec if dec in [ 10, 13, 35, 38, 59, 60, 62 ] else chr (dec) # don't convert `\n\r#&;<>`
payload = re.sub (r'&#(\d+);', lambda x: chrifnotspecial (int (x.group(1))), payload)
# No idea why 'INFORMATION SEPARATOR's ended up in some messages,
# but I decide that I don't need them, and they make the parser barf out...
for dec in [ 28, 29, 30, 31 ]:
    payload = payload.replace (chr (dec), '')

# combine surrogate pairs
payload = payload.encode ('utf-16', 'surrogatepass').decode ('utf-16')
payload = payload.encode ('utf-8')

tree = etree.parse (BytesIO (payload), parser = lxml_parser)
root = tree.getroot()

Obviously, this should replace the existing code in main() for reading the input file.

This assumes that you are using Python 3 (required for chr() with inputs > 255; unichr() of Python 2 should do the job as well, but I didn't test). smsxml2html is almost Python3-compatible except for two minor parts: You have to replace the two occurences of iteritems with items, and msg.text.encode('utf8') by msg.text (or msg.text.strip(), possibly, if you preserve whitespace, but want to drop superfluous spaces at beginning and end of a message). If encoding='utf-8' is given as an additional argument for open(output_path, 'w'), I guess that this should even be fully backwards compatible.

I'm not an expert on Unicode representation by any means. IIUC, the encoding used in those XML files is more UTF-16-ish than UTF-8-ish, and "normal" (more clever) means to convert (like passing the XML to BeautifulSoup) fail because an entity-wise conversion yields invalid results in some intermediate stage (since &#55357; does not translate to a character, but a "surrogate code point", which is more like a modifier for the next character). I might well be totally wrong here; almost my entire wisdom is based on a StackOverflow post concerning &#55357; and some pile of poo reference.
Anyway, my understanding of what I do is: convert each entity independendly and blindly, then convert to UTF-16 while keeping those surrogate pairs alone, then read and interpret them, and then encode again to the more well-received (at least to me) UTF-8.
By the way, no clue why apparently I need to go via the BytesIO, but this works. Using etree.fromstring() instead of etree.parse() did not (although, AFAICS, this should do the same after removing the encoding tag in line 1 of the XML?)...

Caveat: this is a pretty brute-force-ish hands-on approach. Worked for me; YMMV.
I found some evidence that, apparently, funny characters in XML attributes are not really covered by the XML standard, although it seems that XML 1.1 relaxed it somewhat. In any case, the file produced by SMS Backup & Restore seem to not strictly obey the standard in all cases.
This is pretty much "best-effort recovery", with lowest effort for me.

Note that I preserve, among others, the 
 encoding for linebreaks, which would be silently converted to normal spaces by the LXML parser. I like to have white-space: pre-wrap; in the CSS for .month_convos td; I appreciate if my conversation partners spend the effort to type line breaks, so who am I to drop them in the archives?

akobel · 2018-08-22T09:17:43Z

By the way, @T2Fr: recover = True "Seems to be working" - but, unfortunately, at the expense of ignoring the character. These days, that can mean "discarding the message", which would be 💩☹... 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid xmlChar value 55357, line 49, column 101 #1

Invalid xmlChar value 55357, line 49, column 101 #1

kmanwar89 commented Jan 13, 2018 •

edited

Loading

KermMartian commented Jan 13, 2018

T2Fr commented Aug 14, 2018

akobel commented Aug 22, 2018 •

edited

Loading

akobel commented Aug 22, 2018

Invalid xmlChar value 55357, line 49, column 101 #1

Invalid xmlChar value 55357, line 49, column 101 #1

Comments

kmanwar89 commented Jan 13, 2018 • edited Loading

KermMartian commented Jan 13, 2018

T2Fr commented Aug 14, 2018

akobel commented Aug 22, 2018 • edited Loading

akobel commented Aug 22, 2018

kmanwar89 commented Jan 13, 2018 •

edited

Loading

akobel commented Aug 22, 2018 •

edited

Loading