Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issues with Eclipse WTP HTML format special chars #545

Closed
source-knights opened this issue Mar 23, 2020 · 12 comments · Fixed by #550
Closed

Encoding issues with Eclipse WTP HTML format special chars #545

source-knights opened this issue Mar 23, 2020 · 12 comments · Fixed by #550

Comments

@source-knights
Copy link
Contributor

source-knights commented Mar 23, 2020

Hi, I am using the maven spotless version 1.28.0 and Eclipe WTP 4.13.0 (but tried previous versions as well). I'm on windows 10. Tried 3 different developer machines, all showing same issue.

Whenever I use Eclipse WTP / Spotless to format HTML 5 files, the german special chars as in üöäÜÖÄß and the Euro sign € are changed to "üöäÜÖÄ߀". I understand that is actually the binary encoding of these chars if you would wrongly look at the file with non UTF-8 encoding. But as I use UTF-8 in all editors and in the HTML itself and in the spotless config, I don't understand why the files are changed to that by the formatter.

I managed to reprocude this in a simple maven project with only below pom.xml and the pasted HTML file.

Sample HTML5 file (which I save as UTF-8 in IDE, Eclipse, IntelliJ or even Notepad++ all leading to same problem).

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>üöäÜÖÄ߀</title>
</head>
<body>
Test
</body>
</html>

My pom

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.sourceknights.test</groupId>
  <artifactId>spotlesstest</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>spotlesstest</name>
  
  <build>
    <plugins>
		  <plugin>
			  <groupId>com.diffplug.spotless</groupId>
			  <artifactId>spotless-maven-plugin</artifactId>
			  <version>1.28.0</version>
			  <configuration>
			  
			   <encoding>UTF-8</encoding>
			    
			    <formats>

				    <format>

             <encoding>UTF-8</encoding>

				      <includes>
				        <include>src/**/*.html</include>
				      </includes>
				
				      <eclipseWtp>
				        <!-- Specify the WTP formatter type (XML, JS, ...) -->
				        <type>HTML</type>
				        <!-- Optional, available versions: https://github.com/diffplug/spotless/tree/master/lib-extra/src/main/resources/com/diffplug/spotless/extra/eclipse_wtp_formatters -->
				        <version>4.13.0</version>
				      </eclipseWtp>
				    </format>
				  </formats>
			  </configuration>
		  </plugin>
    </plugins>
  </build>
</project>

Does anyone has an idea what I am doing wrong? All these specials chars are proper UTF-8 chars and allowed in HTML5, so they should not be changed.

Thxalot and stay healthy

@source-knights
Copy link
Contributor Author

source-knights commented Mar 23, 2020

Just to clarify, after formatting with mvn spotless:apply the HTML5 file is changed to

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>üöäÜÖÄ߀</title>
</head>
<body>

</body>
</html>

@nedtwigg
Copy link
Member

My untested suspicion is that this bug is specific to the Eclipse WTP formatter. e.g. you would not see this if you used the replace step. My guess is that somewhere in this shim code, we have to tell Eclipse to use UTF-8.

raw = super.format(raw);
// Not sure how Eclipse binds the JS formatter to HTML. The formatting is accomplished manually instead.
IStructuredDocument document = (IStructuredDocument) new HTMLDocumentLoader().createNewStructuredDocument();
document.setPreferredLineDelimiter(LINE_DELIMITER);
document.set(raw);
StructuredDocumentProcessor<CodeFormatter> jsProcessor = new StructuredDocumentProcessor<CodeFormatter>(
document, IHTMLPartitions.SCRIPT, JsRegionProcessor.createFactory(htmlFormatterIndent));
jsProcessor.apply(jsFormatter);
return document.get();

Since we're only passing Strings back and forth, and java Strings are always unicode, then it shouldn't matter, but it wouldn't shock me if there is Eclipse code that roundtrips through binary while assuming an old charset unless you explicitly set it. But it's easy for us to make a test case that confirms whether or not this is Eclipse-WTP specific or not, and if it is, then there's not that many places to look for a fix. @fvgh does this seem plausible to you?

@source-knights
Copy link
Contributor Author

Actually I currently use the replace step after the Eclipse WTP to put the special chars back in as a workaround. So that does not have an encoding problem, only the Eclipse WTP. Sadly the workaround leads to problems with line length, as temporary a line can go over the max line length and is wrapped when it should not due to the 2 chars for 1 special char.

I also tested Eclipse WTP with XML, that is fine and leaves the üöä as they are.

@fvgh
Copy link
Member

fvgh commented Mar 24, 2020

Java uses internally UTF-16 (originally it used UCS-2, but to my understanding, they switched).
Spotless uses the configured encoding (in your case UTF-8) for reading and writing.
So according to your configuration, Spotless should do a UTF-8 to UTF-16 conversion for reading, and a revers conversion afterwards.
Could you provide a HEX dump of the input file?

When opening the modified file, be aware that neither WTP nor Spotless does add a byte order mark(BOM). If the input contains no BOM, the output contains no BOM.

*NIX users have the tendency not to care about the BOM. If any application sees an extension code, it look's up UTF extensions anyway.
For Windows developers the BOM is crucial, since some editors use still per default CP 1252, unless they find a BOM at the beginning of the file.
Be aware that a BOM is optional according to the standard, and I do by no means intend to encourage BOM usage.

Could you provide a HEX dump of the output file? I would like to check whether a BOM got lost or (as I expect) the output is a valid translation of the input without a BOM.

I expect that you switched all your IDE's to use UTF-8 per default, right? If not, I recommend it when you want to work with UTF. I had trouble in the past that a developer (using Jet-Brains editor) messed up a UTF-8 file, since there was no BOM.

@fvgh
Copy link
Member

fvgh commented Mar 24, 2020

@source-knights Sorry, just found a mistake in my previous comment. I would like to see the HEX of input and output.

@fvgh
Copy link
Member

fvgh commented Mar 24, 2020

@nedtwigg I added quickly a test on WTP side to deal with UTF-8 characters. There were no problems. But I must admit, I am not 100% sure that we handle a BOM correctly. Currently the reading/writing just passes the byte sequence on to the formatters. Not sure whether this is a good idea.

@source-knights
Copy link
Contributor Author

source-knights commented Mar 24, 2020

Hi, here are the HEX contents. Please also see my comment above that the Eclipse WTP XML formatter does not have that issue.

Input (the one with correct üöäÜÖÄ߀):
image

Output:
image

@fvgh
Copy link
Member

fvgh commented Mar 24, 2020

@source-knights I may have found the problem. Could you use in the meantime the Java system property file.encoding with UTF-8? I am afraid I have no better work-around.

@source-knights
Copy link
Contributor Author

source-knights commented Mar 24, 2020

I can confirm all fine when I use
mvn spotless:apply -Dfile.encoding=UTF-8

Thx for looking into this so quickly. Is there is anything I can do to help just shout

@fvgh
Copy link
Member

fvgh commented Mar 24, 2020

Took the liberty to delete a few of my previous comments regarding error analysis. Was in a hurry and lacking caffeine. The comments were not correct

Spotless framework assures a conversion form the specified format to the internal format UTF-16. That's also problematic when it comes to the BOM, since Java does not strip it.

However @nedtwigg was right to suspect my WTP implementation. It always needs to use UTF-16, since Spotless already did the decoding, as I highlighted in my initial comment.
The BOM is currently stripped by the WTP. This should be discussed ion a separate issue, since it does not make sense to give it to the formatters in the first place as stated before.

@fvgh fvgh changed the title Encoding issues with Eclipse WTP HTML format and german special chars Encoding issues with Eclipse WTP HTML format special chars Mar 24, 2020
@source-knights
Copy link
Contributor Author

Thxalot for the quick fix. Now I just need typescript checks as a maven plugin... Will look into that later, maybe I can code it :)

@nedtwigg
Copy link
Member

nedtwigg commented Apr 2, 2020

Fixed in gradle 3.28.1, maven 1.29.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants