Garbled Chinese Latin1 --> UTF-8 #206

gbnewby · 2023-11-05T19:12:46Z

A reader reported this file is garbled. Indeed, I could not get it to display correctly using any viewer:

https://gutenberg.org/cache/epub/37626/pg37626.txt

That's UTF-8 with a BOM.

The original is ISO-8859-1 and doesn't seem to have any problems:

https://www.gutenberg.org/files/37626/37626-8.txt

I would not have guessed that we have Chinese eBooks as Latin1, but this one is. I can look for others if it might help.

nfenwick · 2023-11-05T19:46:46Z

The "original" appears to be UTF-8.
37626-8.txt: Unicode text, UTF-8 (with BOM) text, with very long lines (742), with CRLF line terminators
Though the header claims it is ISO-8859-1, it isn't. Is it possible that an attempt was made to read it as Latin-1 in a conversion to UTF-8. That would have created a severely garbled file.

eshellman · 2023-11-05T20:41:28Z

The original is UTF-8 with BOM, though it claims to be ISO-8859-1. It's Japanese, not Chinese. The Ebookmaker text parser appears to believe the claimed encoding. I thought I could change it in the db, but that seems not to have an effect. the html parser hits an error and falls back to the correct encoding/ Will look at it more closely tomorrow. Aside - the html file for this book includes ruby markup, which is not supported by EPUB2 and thus emits a lot of validation errors.

…

On Nov 5, 2023, at 2:12 PM, Greg Newby ***@***.***> wrote: A reader reported this file is garbled. Indeed, I could not get it to display correctly using any viewer: https://gutenberg.org/cache/epub/37626/pg37626.txt That's UTF-8 with a BOM. The original is ISO-8859-1 and doesn't seem to have any problems: https://www.gutenberg.org/files/37626/37626-8.txt I would not have guessed that we have Chinese eBooks as Latin1, but this one is. I can look for others if it might help. — Reply to this email directly, view it on GitHub <#206>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHCGMO6YIXNMLWE54YRZ3TYC7QLTAVCNFSM6AAAAAA66RB3TWVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TOOJSGE2DKNY>. You are receiving this because you are subscribed to this thread.

eshellman · 2023-11-27T20:50:08Z

I've now examined this problem.

While we have tools that can try to guess at the encoding of a file, I don't know whether this will be more accurate than the encoding reported in the Gutenberg header.

I have now scanned the 34,867 plain text "ISO-8859-1" files in the collection. and used the encoding guesser in BeautifulSoup, "UnicodeDammit", to check them. 5,701 of them are not ISO-8859-1 according to UnicodeDammit.

The first one I checked was #23892, which was guessed to really be ISO-8859-10. This misencoding seems to have swapped accents, so that 'Esmé" has turned into 'Esmè', a mistake that has widely propagated on the internet.

The most noticeable problems are where, as in #37626, a UTF8 file is declared as ISO-8859-1.

Luckily, this occurs in only 54 of our books. I suggest that they be queued for remediation:

Latin1 /public/vhost/g/gutenberg/html/files/11583/11583-8.txt utf-8
Latin1 /public/vhost/g/gutenberg/html/files/11584/11584-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/12421/12421-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/13212/13212-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/13733/13733-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/13890/13890-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/14163/14163-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/14565/14565-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/15686/15686-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/15756/15756-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/18224/18224-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/20247/20247-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/21342/21342-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/22091/22091-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/22261/22261-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/22457/22457-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/24090/24090-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/26006/26006-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/26038/26038-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/30054/30054-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/33798/33798-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/35951/35951-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/36833/36833-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/37209/37209-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/37566/37566-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/37852/37852-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39281/39281-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39338/39338-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39740/39740-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39771/39771-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39961/39961-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/40227/40227-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/40409/40409-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/40988/40988-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/41707/41707-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/42090/42090-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/42269/42269-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/44476/44476-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/48470/48470-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/50082/50082-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/54860/54860-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56325/56325-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56462/56462-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56498/56498-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56617/56617-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56635/56635-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/57813/57813-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/58781/58781-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/59145/59145-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/59540/59540-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/59706/59706-8.txt utf-8
n /public/vhost/g/gutenberg/html/files/60254/60254-8.txt utf-8
A /public/vhost/g/gutenberg/html/files/61394/61394-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/62252/62252-8.txt utf-8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbled Chinese Latin1 --> UTF-8 #206

Garbled Chinese Latin1 --> UTF-8 #206

gbnewby commented Nov 5, 2023

nfenwick commented Nov 5, 2023

eshellman commented Nov 5, 2023 via email

eshellman commented Nov 27, 2023

Garbled Chinese Latin1 --> UTF-8 #206

Garbled Chinese Latin1 --> UTF-8 #206

Comments

gbnewby commented Nov 5, 2023

nfenwick commented Nov 5, 2023

eshellman commented Nov 5, 2023 via email

eshellman commented Nov 27, 2023