Fix parsing an UTF-8 file without BOM and ISO-8859-1 encoding (#242) #243

belingueres · 2023-03-12T03:51:52Z

Deleted most code handling encoding (leaving that job to the XmlReader
Fixed tests exercising encoding checks. Unsupported tests were skipped
Simplified test-encoding-ISO-8859-1.xml test file

gnodet

This issue has been addressed already. I don't see any real problem in the way the MXParser handles things. In this case, the parser is given a reader with a stream which does not correspond to the encoding in the BOM.

Also, I think would go in the exact opposite direction of recent commits in maven where we removed the indirection the XmlReader layer.

apache/maven@bec19ec

…us-plexus#242) * Deleted most code handling encoding (leaving that job to the XmlReader * Fixed tests exercising encoding checks. Unsupported tests were skipped * Simplified test-encoding-ISO-8859-1.xml test file Skipped even more tests that pass on Linux but fail on Windows.

belingueres · 2023-03-14T03:08:04Z

Hi @gnodet:
The issue is present when the parser's input is not an XmlReader. The MXParser just don't have the information if the file to parse has a BOM or not (the only reliable way to get this information is to open the file as binary, as XmlReader does). So the question is if MXParser should be smart detecting encoding issues, then a) it should not try to check anything and rely encoding issues to XmlReader, or b) Integrate XmlReader functionality into the parser to have the full information.
I supposed it may be the first option what's more pragmatic, but I might be mistaken.
I'm not following the maven dev list so close these days...perhaps the safe thing to do is to put this PR on hold, and apply it only in case the bug start to annoy users.

gnodet

I think it's fine to throw an exception if the xml has a declaration which is incompatible with the encoding used to parse the stream. This should not happen if the stream is given, so the cases are when a Reader is provided or an InputStream with a specific encoding.

If we want to cover those use cases, maybe a different setInput() method which would provide a default encoding, or one that would ignore the encoding provided in the BOM. (this could also be implemented using a custom XmlReader).

gnodet · 2023-03-14T10:48:07Z

src/test/java/org/codehaus/plexus/util/xml/pull/MXParserTest.java

+    /**
+     * Issue 163: https://github.com/codehaus-plexus/plexus-utils/issues/163
+     *
+     * Another case of bug #163: InputStream supplied with the right file encoding.


In this case, the encoding is not the correct one.

gnodet · 2023-03-14T10:48:55Z

src/test/java/org/codehaus/plexus/util/xml/pull/MXParserTest.java

+     * @since 3.5.2
+     */
+    @Test
+    public void testEncodingISO_8859_1_InputStream_encoded() throws IOException {


So in this test, the BOM says ISO-8859-1 but the parser is forced to use UTF-8. Isn't that wrong ?

Same but using an InputStream encoded in UTF-8, instead of a Reader.

gnodet · 2023-03-14T10:50:07Z

src/test/java/org/codehaus/plexus/util/xml/pull/MXParserTest.java

+    /**
+     * Issue 163: https://github.com/codehaus-plexus/plexus-utils/issues/163
+     *
+     * Another case of bug #163: Reader generated with ReaderFactory.newReader and the right file encoding.


The BOM is ISO-8859-1 but the file is decoded with UTF-8

The file is encoded with UTF-8 without BOM, and XML entities are encoded in ISO-8859-1 (this is a valid combination). This is the test that reproduces the bug, which throws an exception as if the file encoding were UTF-8 with BOM (the invalid combination).
So as you said before, even if Maven is not affected by this, the bug in the parser is real.

I understand this is considered a valid combination by the xml spec. But this will produce incorrect results, as the input stream is most probably decoded using UTF-8 instead of ISO-8859-1.
I'm fine with that if that's what is desired. However, I think we can achieve both and I'll work on a fix.

gnodet · 2023-03-14T10:51:43Z

src/test/java/org/codehaus/plexus/util/xml/pull/MXParserTest.java

+        throws IOException
+    {
+        try ( Reader reader =
+            ReaderFactory.newReader( new File( "src/test/resources/xml", "test-encoding-ISO-8859-1.xml" ),


The test is using a simple Reader, so the BOM is completely ignored by the reader, meaning the file will be read with UTF-8 while the BOM indicates it's encoded in ISO-8859-1.

Reproduces the same problem, showing that the Reader created by ReaderFactory.newReader() is affected too.

gnodet · 2023-03-14T10:57:02Z

.../org/codehaus/plexus/util/xml/pull/eduni_misc_Test_BjoernHoehrmannviaHST2013_09_18_Test.java

     */
-    @Test
+    // @Test


I think the test is ok. It shows an error in the code.
The correct way to use the parser would be:

try ( FileInputStream is = new FileInputStream( new File( testResourcesDir, "007.xml" ) ) ) { parser.setInput( is ); ... }

Since the PR deleted the encoding detection inside the MXParser, this test case (part or the standard xml 1.0 test suite) is not supported. That's why is skipped from execution.

gnodet · 2023-03-14T10:58:22Z

.../org/codehaus/plexus/util/xml/pull/eduni_misc_Test_BjoernHoehrmannviaHST2013_09_18_Test.java

-    @Test
-    public void testhst_lhs_008()
+    // @Test
+    public void testhst_lhs_008_newReader()


Same here. The code enforces the usage of UTF-16, so the underlying stream has to be properly encoded.

same explanation as testhst_lhs_007()

* enable testhst_lhs_007, testhst_lhs_008 and testhst_lhs_009 for InputStream * disable those tests on readers, as readers bypass any encoding * do not try to discover the encoding used when the input is given a Reader * add an SIO-8859-1 encoded coment in the test xml (testEncodingISO_8859_1_newReader and testEncodingISO_8859_1_InputStream_encoded tests do decode it wrongly as they use UTF-8)

gnodet · 2023-03-17T18:50:12Z

@belingueres I've pushed a commit on your branch, this keeps the detection for input streams for silently ignore incoherent encoding when a Reader is passed as the input.

belingueres · 2023-03-18T18:50:36Z

@gnodet Thanks for the patch! Changing to strict encoding detection is a bolder change than I was pursuing, but it LGTM.

slachiewicz · 2023-04-07T08:59:16Z

Now we have conflicts here. Needs review

gnodet · 2023-04-07T09:08:39Z

This is superseded with codehaus-plexus/plexus-xml#1

gnodet requested changes Mar 13, 2023

View reviewed changes

belingueres force-pushed the ISSUE-242 branch from cce8bf5 to 831f645 Compare March 13, 2023 16:20

gnodet requested changes Mar 14, 2023

View reviewed changes

gnodet self-requested a review March 27, 2023 08:13

gnodet approved these changes Mar 27, 2023

View reviewed changes

olamy added the bug label Mar 28, 2023

belingueres mentioned this pull request Apr 6, 2023

MXParser error when parsing an UTF-8 file without BOM and ISO-8859-1 element encoding #242

Closed

gnodet closed this Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parsing an UTF-8 file without BOM and ISO-8859-1 encoding (#242) #243

Fix parsing an UTF-8 file without BOM and ISO-8859-1 encoding (#242) #243

belingueres commented Mar 12, 2023

gnodet left a comment

belingueres commented Mar 14, 2023

gnodet left a comment

gnodet Mar 14, 2023

gnodet Mar 14, 2023

belingueres Mar 17, 2023

gnodet Mar 14, 2023

belingueres Mar 17, 2023

gnodet Mar 17, 2023

gnodet Mar 14, 2023

belingueres Mar 17, 2023

gnodet Mar 14, 2023

belingueres Mar 17, 2023

gnodet Mar 14, 2023

belingueres Mar 17, 2023

gnodet commented Mar 17, 2023

belingueres commented Mar 18, 2023

slachiewicz commented Apr 7, 2023

gnodet commented Apr 7, 2023

Fix parsing an UTF-8 file without BOM and ISO-8859-1 encoding (#242) #243

Fix parsing an UTF-8 file without BOM and ISO-8859-1 encoding (#242) #243

Conversation

belingueres commented Mar 12, 2023

gnodet left a comment

Choose a reason for hiding this comment

belingueres commented Mar 14, 2023

gnodet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnodet commented Mar 17, 2023

belingueres commented Mar 18, 2023

slachiewicz commented Apr 7, 2023

gnodet commented Apr 7, 2023