Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing an UTF-8 file without BOM and ISO-8859-1 encoding (#242) #243

Closed
wants to merge 2 commits into from

Conversation

belingueres
Copy link
Contributor

  • Deleted most code handling encoding (leaving that job to the XmlReader
  • Fixed tests exercising encoding checks. Unsupported tests were skipped
  • Simplified test-encoding-ISO-8859-1.xml test file

Copy link
Member

@gnodet gnodet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has been addressed already. I don't see any real problem in the way the MXParser handles things. In this case, the parser is given a reader with a stream which does not correspond to the encoding in the BOM.

Also, I think would go in the exact opposite direction of recent commits in maven where we removed the indirection the XmlReader layer.

apache/maven@bec19ec

…us-plexus#242)

* Deleted most code handling encoding (leaving that job to the XmlReader
* Fixed tests exercising encoding checks. Unsupported tests were skipped
* Simplified test-encoding-ISO-8859-1.xml test file

Skipped even more tests that pass on Linux but fail on Windows.
@belingueres
Copy link
Contributor Author

Hi @gnodet:
The issue is present when the parser's input is not an XmlReader. The MXParser just don't have the information if the file to parse has a BOM or not (the only reliable way to get this information is to open the file as binary, as XmlReader does). So the question is if MXParser should be smart detecting encoding issues, then a) it should not try to check anything and rely encoding issues to XmlReader, or b) Integrate XmlReader functionality into the parser to have the full information.
I supposed it may be the first option what's more pragmatic, but I might be mistaken.
I'm not following the maven dev list so close these days...perhaps the safe thing to do is to put this PR on hold, and apply it only in case the bug start to annoy users.

Copy link
Member

@gnodet gnodet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to throw an exception if the xml has a declaration which is incompatible with the encoding used to parse the stream. This should not happen if the stream is given, so the cases are when a Reader is provided or an InputStream with a specific encoding.

If we want to cover those use cases, maybe a different setInput() method which would provide a default encoding, or one that would ignore the encoding provided in the BOM. (this could also be implemented using a custom XmlReader).

/**
* Issue 163: https://github.com/codehaus-plexus/plexus-utils/issues/163
*
* Another case of bug #163: InputStream supplied with the right file encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the encoding is not the correct one.

* @since 3.5.2
*/
@Test
public void testEncodingISO_8859_1_InputStream_encoded() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this test, the BOM says ISO-8859-1 but the parser is forced to use UTF-8. Isn't that wrong ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same but using an InputStream encoded in UTF-8, instead of a Reader.

/**
* Issue 163: https://github.com/codehaus-plexus/plexus-utils/issues/163
*
* Another case of bug #163: Reader generated with ReaderFactory.newReader and the right file encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BOM is ISO-8859-1 but the file is decoded with UTF-8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is encoded with UTF-8 without BOM, and XML entities are encoded in ISO-8859-1 (this is a valid combination). This is the test that reproduces the bug, which throws an exception as if the file encoding were UTF-8 with BOM (the invalid combination).
So as you said before, even if Maven is not affected by this, the bug in the parser is real.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is considered a valid combination by the xml spec. But this will produce incorrect results, as the input stream is most probably decoded using UTF-8 instead of ISO-8859-1.
I'm fine with that if that's what is desired. However, I think we can achieve both and I'll work on a fix.

throws IOException
{
try ( Reader reader =
ReaderFactory.newReader( new File( "src/test/resources/xml", "test-encoding-ISO-8859-1.xml" ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is using a simple Reader, so the BOM is completely ignored by the reader, meaning the file will be read with UTF-8 while the BOM indicates it's encoded in ISO-8859-1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reproduces the same problem, showing that the Reader created by ReaderFactory.newReader() is affected too.

*/
@Test
// @Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test is ok. It shows an error in the code.
The correct way to use the parser would be:

try ( FileInputStream is = new FileInputStream( new File( testResourcesDir, "007.xml" ) ) )
{
    parser.setInput( is );
    ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the PR deleted the encoding detection inside the MXParser, this test case (part or the standard xml 1.0 test suite) is not supported. That's why is skipped from execution.

@Test
public void testhst_lhs_008()
// @Test
public void testhst_lhs_008_newReader()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. The code enforces the usage of UTF-16, so the underlying stream has to be properly encoded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same explanation as testhst_lhs_007()

* enable testhst_lhs_007, testhst_lhs_008 and testhst_lhs_009 for InputStream
* disable those tests on readers, as readers bypass any encoding
* do not try to discover the encoding used when the input is given a Reader
* add an SIO-8859-1 encoded coment in the test xml (testEncodingISO_8859_1_newReader and testEncodingISO_8859_1_InputStream_encoded tests do decode it wrongly as they use UTF-8)
@gnodet
Copy link
Member

gnodet commented Mar 17, 2023

@belingueres I've pushed a commit on your branch, this keeps the detection for input streams for silently ignore incoherent encoding when a Reader is passed as the input.

@belingueres
Copy link
Contributor Author

@gnodet Thanks for the patch! Changing to strict encoding detection is a bolder change than I was pursuing, but it LGTM.

@gnodet gnodet self-requested a review March 27, 2023 08:13
@slachiewicz
Copy link
Member

Now we have conflicts here. Needs review

@gnodet
Copy link
Member

gnodet commented Apr 7, 2023

This is superseded with codehaus-plexus/plexus-xml#1

@gnodet gnodet closed this Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants