-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbled Chinese Latin1 --> UTF-8 #206
Comments
The "original" appears to be UTF-8. |
The original is UTF-8 with BOM, though it claims to be ISO-8859-1. It's Japanese, not Chinese.
The Ebookmaker text parser appears to believe the claimed encoding. I thought I could change it in the db, but that seems not to have an effect.
the html parser hits an error and falls back to the correct encoding/
Will look at it more closely tomorrow.
Aside - the html file for this book includes ruby markup, which is not supported by EPUB2 and thus emits a lot of validation errors.
… On Nov 5, 2023, at 2:12 PM, Greg Newby ***@***.***> wrote:
A reader reported this file is garbled. Indeed, I could not get it to display correctly using any viewer:
https://gutenberg.org/cache/epub/37626/pg37626.txt
That's UTF-8 with a BOM.
The original is ISO-8859-1 and doesn't seem to have any problems:
https://www.gutenberg.org/files/37626/37626-8.txt
I would not have guessed that we have Chinese eBooks as Latin1, but this one is. I can look for others if it might help.
—
Reply to this email directly, view it on GitHub <#206>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHCGMO6YIXNMLWE54YRZ3TYC7QLTAVCNFSM6AAAAAA66RB3TWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TOOJSGE2DKNY>.
You are receiving this because you are subscribed to this thread.
|
I've now examined this problem. While we have tools that can try to guess at the encoding of a file, I don't know whether this will be more accurate than the encoding reported in the Gutenberg header. I have now scanned the 34,867 plain text "ISO-8859-1" files in the collection. and used the encoding guesser in BeautifulSoup, "UnicodeDammit", to check them. 5,701 of them are not ISO-8859-1 according to UnicodeDammit. The first one I checked was #23892, which was guessed to really be ISO-8859-10. This misencoding seems to have swapped accents, so that 'Esmé" has turned into 'Esmè', a mistake that has widely propagated on the internet. The most noticeable problems are where, as in #37626, a UTF8 file is declared as ISO-8859-1. Luckily, this occurs in only 54 of our books. I suggest that they be queued for remediation: Latin1 /public/vhost/g/gutenberg/html/files/11583/11583-8.txt utf-8 |
A reader reported this file is garbled. Indeed, I could not get it to display correctly using any viewer:
https://gutenberg.org/cache/epub/37626/pg37626.txt
That's UTF-8 with a BOM.
The original is ISO-8859-1 and doesn't seem to have any problems:
https://www.gutenberg.org/files/37626/37626-8.txt
I would not have guessed that we have Chinese eBooks as Latin1, but this one is. I can look for others if it might help.
The text was updated successfully, but these errors were encountered: