Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbled Chinese Latin1 --> UTF-8 #206

Open
gbnewby opened this issue Nov 5, 2023 · 3 comments
Open

Garbled Chinese Latin1 --> UTF-8 #206

gbnewby opened this issue Nov 5, 2023 · 3 comments

Comments

@gbnewby
Copy link
Collaborator

gbnewby commented Nov 5, 2023

A reader reported this file is garbled. Indeed, I could not get it to display correctly using any viewer:

https://gutenberg.org/cache/epub/37626/pg37626.txt

That's UTF-8 with a BOM.

The original is ISO-8859-1 and doesn't seem to have any problems:

https://www.gutenberg.org/files/37626/37626-8.txt

I would not have guessed that we have Chinese eBooks as Latin1, but this one is. I can look for others if it might help.

@nfenwick
Copy link

nfenwick commented Nov 5, 2023

The "original" appears to be UTF-8.
37626-8.txt: Unicode text, UTF-8 (with BOM) text, with very long lines (742), with CRLF line terminators
Though the header claims it is ISO-8859-1, it isn't. Is it possible that an attempt was made to read it as Latin-1 in a conversion to UTF-8. That would have created a severely garbled file.

@eshellman
Copy link
Collaborator

eshellman commented Nov 5, 2023 via email

@eshellman
Copy link
Collaborator

I've now examined this problem.

While we have tools that can try to guess at the encoding of a file, I don't know whether this will be more accurate than the encoding reported in the Gutenberg header.

I have now scanned the 34,867 plain text "ISO-8859-1" files in the collection. and used the encoding guesser in BeautifulSoup, "UnicodeDammit", to check them. 5,701 of them are not ISO-8859-1 according to UnicodeDammit.

The first one I checked was #23892, which was guessed to really be ISO-8859-10. This misencoding seems to have swapped accents, so that 'Esmé" has turned into 'Esmè', a mistake that has widely propagated on the internet.

The most noticeable problems are where, as in #37626, a UTF8 file is declared as ISO-8859-1.

Luckily, this occurs in only 54 of our books. I suggest that they be queued for remediation:

Latin1 /public/vhost/g/gutenberg/html/files/11583/11583-8.txt utf-8
Latin1 /public/vhost/g/gutenberg/html/files/11584/11584-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/12421/12421-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/13212/13212-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/13733/13733-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/13890/13890-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/14163/14163-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/14565/14565-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/15686/15686-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/15756/15756-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/18224/18224-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/20247/20247-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/21342/21342-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/22091/22091-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/22261/22261-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/22457/22457-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/24090/24090-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/26006/26006-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/26038/26038-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/30054/30054-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/33798/33798-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/35951/35951-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/36833/36833-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/37209/37209-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/37566/37566-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/37852/37852-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39281/39281-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39338/39338-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39740/39740-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39771/39771-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/39961/39961-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/40227/40227-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/40409/40409-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/40988/40988-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/41707/41707-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/42090/42090-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/42269/42269-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/44476/44476-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/48470/48470-0.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/50082/50082-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/54860/54860-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56325/56325-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56462/56462-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56498/56498-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56617/56617-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/56635/56635-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/57813/57813-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/58781/58781-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/59145/59145-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/59540/59540-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/59706/59706-8.txt utf-8
n /public/vhost/g/gutenberg/html/files/60254/60254-8.txt utf-8
A /public/vhost/g/gutenberg/html/files/61394/61394-8.txt utf-8
ISO-8859-1 /public/vhost/g/gutenberg/html/files/62252/62252-8.txt utf-8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants