[Bug] Characters like äüö are output incorrectly #19

jamal2362 · 2021-06-26T23:25:44Z

Characters like äüö are output incorrectly on some websites.
In the German language these characters are often used.
In English it does not occur and there is not this problem.

Here is a picture how this looks like on Google.

Here is a screenshot where it is displayed without problems äüö.

dankito · 2021-06-27T21:47:42Z

I don't think it's a Readability4J issue but that you have to wrap the output in a structure like this to set encoding to UTF-8 (see #2):

<html>
 <head>
  <meta charset="utf-8" /> 
 </head>
 <body>
 <!-- output here -->
 </body>
</html>

This is exactly what article.getContentWithUtf8Encoding() does. Does it work for you?

jamal2362 · 2021-06-27T21:51:58Z

Hi,
Yes i'm using article.getContentWithUtf8Encoding() in the Code.

I have only noticed this strange issue with Google so far. Other pages work fine with äöü co.

michaldvorak79 · 2021-10-24T00:08:19Z

@jamal2362 Is it possible the website uses a charset other than UTF-8 and you don't take that into account when creating your stringBuffer?

dankito · 2021-10-26T22:51:58Z

You're right, article.getContentWithUtf8Encoding() didn't take into account document's charset.

Created now the method article.getContentWithDocumentsCharsetOrUtf8() which exactly just does that.

But i don't think that will resolve @jamal2362's issue as above document, google.de, has its charset already set to UTF-8.

Try version 1.0.8 if it solves your issue but i think the issue lies somewhere else.

michaldvorak79 · 2021-10-27T07:52:04Z

@dankito My apologies, my question was aimed at @jamal2362, sorry if that wasn't clear. I don't think your library does anything wrong. I think the String that's being passed to your library is already wrong, because the code creating the String doesn't check the website encoding.

The same thing actually happened to me and I thought for a while that Readability4J was malfunctioning before realizing it was my own fault :-)

jamal2362 · 2021-11-06T10:30:57Z

@dankito
Thank you for your work!
Unfortunately, this did not help.
Am I doing something wrong in my code?
Do you also have the problems with "google.de" ?

@michaldvorak79
What does that mean exactly?
What should I change?

michaldvorak79 · 2021-11-08T09:36:00Z

@jamal2362 What I mean is this: when you download a web page, you have a byte array, right? But Readability4J requires String. So you have to convert the byte array to String. And for that you need to know the web page character encoding (or "charset"). Whether it's UTF-8 or Windows-1252 or ISO-8859-1 or what. And you have to let Java know which character encoding the byte array uses, otherwise the String will not be created correctly. For example, if you have a webpage that uses the ISO encoding and you convert it into String using the UTF-8 encoding, it will keep regular english characters (as those are the same in both encodings), but it will mangle special characters.

Charset can normally be obtained from the response HTTP headers or it's included in a <meta> tag in the HTML code.

I don't know what your code looks like exactly and how do you obtain the data in your stringBuffer, but my theory was that maybe you always create the data in the stringBuffer as UTF-8 and the websites that give you trouble actually use a different character encoding.

You can check your htmlData variable after you create it and see whether it contains the proper special characters, or whether they are already mangled. If the special characters are good in your htmlData and bad in Readability4J's output, then the library is doing something wrong. If the characters are already mangled in htmlData, then you use the wrong character encoding when turning byte array into String.

codinux-gmbh · 2021-11-09T00:22:55Z

Can you post your code how you download web page's HTML, Jamal?

Maybe this code helps you:

    val uri = "https://google.de" // set your url here
    val document = Jsoup.parse(URL(uri), 10000)
    val readability = Readability4JExtended(uri, document.outerHtml())

    val article = readability.parse()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Characters like äüö are output incorrectly #19

[Bug] Characters like äüö are output incorrectly #19

jamal2362 commented Jun 26, 2021 •

edited

Loading

dankito commented Jun 27, 2021

jamal2362 commented Jun 27, 2021

michaldvorak79 commented Oct 24, 2021 •

edited

Loading

dankito commented Oct 26, 2021

michaldvorak79 commented Oct 27, 2021

jamal2362 commented Nov 6, 2021 •

edited

Loading

michaldvorak79 commented Nov 8, 2021

codinux-gmbh commented Nov 9, 2021

[Bug] Characters like äüö are output incorrectly #19

[Bug] Characters like äüö are output incorrectly #19

Comments

jamal2362 commented Jun 26, 2021 • edited Loading

dankito commented Jun 27, 2021

jamal2362 commented Jun 27, 2021

michaldvorak79 commented Oct 24, 2021 • edited Loading

dankito commented Oct 26, 2021

michaldvorak79 commented Oct 27, 2021

jamal2362 commented Nov 6, 2021 • edited Loading

michaldvorak79 commented Nov 8, 2021

codinux-gmbh commented Nov 9, 2021

jamal2362 commented Jun 26, 2021 •

edited

Loading

michaldvorak79 commented Oct 24, 2021 •

edited

Loading

jamal2362 commented Nov 6, 2021 •

edited

Loading