-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Japanese version of arabic numerals appear to be removed by textract #145
Comments
I can take a look at this. It is about time to cycle through the latests asks/bugs and get a new version out. =) |
This happens on Linux, too, with antiword. It is happening with all of doc, odt and pdf. (In all my tests, the test files were made by LibreOffice, loading from a UTF-8 plain text file.) Double-width parentheses are also being lost: (FF08, FF09) In each case, a string of 1+ of these characters are being replaced with a single space. However, ten (FF64) and maru (FF61) are coming through fine, and they are in that same unicode block. |
I'm noticing that antiword on OSX seems to handle Japanese zenkaku characters (numbers and parentheses) - it would be cool to be able to override which word processor was used on which platform ... |
The bug is in the regexes at the top of lib/extract.js: If I change the I get no new test failures (I'll describe the test failures I get in a separate github issue, in a moment). That is a band-aid fix. I think the deeper problem is the code uses a whitelist to "not remove anything that is not whitespace", when it should use a blacklist to "remove whitespace"? Are there actually any more than listed here: https://en.wikipedia.org/wiki/Newline#Unicode (Just checking some of my own code that does something similar, and I treat \u0085, \u2028, \u2029 the same as \n and I also have \u2009 to convert narrow space to be a normal ascii space.) Is the intention of that code to also be removing control characters? See: https://en.wikipedia.org/wiki/Unicode_control_characters and https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_set |
Textract installed great on OSX 10.12.6 for me, and is working fine to extract English text from doc files ...
However I note a problem with the Japanese version of Arabic numerals. In both Japanese doc and text files run through textract the main Japanese text comes through fine, but where there were Arabic numerals (e.g. 2018) in Japanese text format, they are removed from the output.
becomes
Has anyone experienced anything similar?
The text was updated successfully, but these errors were encountered: