Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Characters like äüö are output incorrectly #19

Open
jamal2362 opened this issue Jun 26, 2021 · 8 comments
Open

[Bug] Characters like äüö are output incorrectly #19

jamal2362 opened this issue Jun 26, 2021 · 8 comments

Comments

@jamal2362
Copy link

jamal2362 commented Jun 26, 2021

Characters like äüö are output incorrectly on some websites.
In the German language these characters are often used.
In English it does not occur and there is not this problem.

Here is a picture how this looks like on Google.
Screenshot_20210627-012434

Here is a screenshot where it is displayed without problems äüö.
Screenshot_20210627-013445

@dankito
Copy link
Owner

dankito commented Jun 27, 2021

I don't think it's a Readability4J issue but that you have to wrap the output in a structure like this to set encoding to UTF-8 (see #2):

<html>
 <head>
  <meta charset="utf-8" /> 
 </head>
 <body>
 <!-- output here -->
 </body>
</html>

This is exactly what article.getContentWithUtf8Encoding() does. Does it work for you?

@jamal2362
Copy link
Author

Hi,
Yes i'm using article.getContentWithUtf8Encoding() in the Code.

I have only noticed this strange issue with Google so far. Other pages work fine with äöü co.

Screenshot_20210627-234916__01

@michaldvorak79
Copy link

michaldvorak79 commented Oct 24, 2021

@jamal2362 Is it possible the website uses a charset other than UTF-8 and you don't take that into account when creating your stringBuffer?

@dankito
Copy link
Owner

dankito commented Oct 26, 2021

You're right, article.getContentWithUtf8Encoding() didn't take into account document's charset.

Created now the method article.getContentWithDocumentsCharsetOrUtf8() which exactly just does that.

But i don't think that will resolve @jamal2362's issue as above document, google.de, has its charset already set to UTF-8.

Try version 1.0.8 if it solves your issue but i think the issue lies somewhere else.

@michaldvorak79
Copy link

@dankito My apologies, my question was aimed at @jamal2362, sorry if that wasn't clear. I don't think your library does anything wrong. I think the String that's being passed to your library is already wrong, because the code creating the String doesn't check the website encoding.

The same thing actually happened to me and I thought for a while that Readability4J was malfunctioning before realizing it was my own fault :-)

@jamal2362
Copy link
Author

jamal2362 commented Nov 6, 2021

@dankito
Thank you for your work!
Unfortunately, this did not help.
Am I doing something wrong in my code?
Do you also have the problems with "google.de" ?

@michaldvorak79
What does that mean exactly?
What should I change?

@michaldvorak79
Copy link

@jamal2362 What I mean is this: when you download a web page, you have a byte array, right? But Readability4J requires String. So you have to convert the byte array to String. And for that you need to know the web page character encoding (or "charset"). Whether it's UTF-8 or Windows-1252 or ISO-8859-1 or what. And you have to let Java know which character encoding the byte array uses, otherwise the String will not be created correctly. For example, if you have a webpage that uses the ISO encoding and you convert it into String using the UTF-8 encoding, it will keep regular english characters (as those are the same in both encodings), but it will mangle special characters.

Charset can normally be obtained from the response HTTP headers or it's included in a <meta> tag in the HTML code.

I don't know what your code looks like exactly and how do you obtain the data in your stringBuffer, but my theory was that maybe you always create the data in the stringBuffer as UTF-8 and the websites that give you trouble actually use a different character encoding.

You can check your htmlData variable after you create it and see whether it contains the proper special characters, or whether they are already mangled. If the special characters are good in your htmlData and bad in Readability4J's output, then the library is doing something wrong. If the characters are already mangled in htmlData, then you use the wrong character encoding when turning byte array into String.

@codinux-gmbh
Copy link

Can you post your code how you download web page's HTML, Jamal?

Maybe this code helps you:

    val uri = "https://google.de" // set your url here
    val document = Jsoup.parse(URL(uri), 10000)
    val readability = Readability4JExtended(uri, document.outerHtml())

    val article = readability.parse()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants