-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parsing an UTF-8 file without BOM and ISO-8859-1 encoding (#242) #243
Conversation
belingueres
commented
Mar 12, 2023
- Deleted most code handling encoding (leaving that job to the XmlReader
- Fixed tests exercising encoding checks. Unsupported tests were skipped
- Simplified test-encoding-ISO-8859-1.xml test file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has been addressed already. I don't see any real problem in the way the MXParser
handles things. In this case, the parser is given a reader with a stream which does not correspond to the encoding in the BOM.
Also, I think would go in the exact opposite direction of recent commits in maven where we removed the indirection the XmlReader
layer.
…us-plexus#242) * Deleted most code handling encoding (leaving that job to the XmlReader * Fixed tests exercising encoding checks. Unsupported tests were skipped * Simplified test-encoding-ISO-8859-1.xml test file Skipped even more tests that pass on Linux but fail on Windows.
Hi @gnodet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to throw an exception if the xml has a declaration which is incompatible with the encoding used to parse the stream. This should not happen if the stream is given, so the cases are when a Reader
is provided or an InputStream
with a specific encoding.
If we want to cover those use cases, maybe a different setInput()
method which would provide a default encoding, or one that would ignore the encoding provided in the BOM. (this could also be implemented using a custom XmlReader
).
/** | ||
* Issue 163: https://github.com/codehaus-plexus/plexus-utils/issues/163 | ||
* | ||
* Another case of bug #163: InputStream supplied with the right file encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the encoding is not the correct one.
* @since 3.5.2 | ||
*/ | ||
@Test | ||
public void testEncodingISO_8859_1_InputStream_encoded() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this test, the BOM says ISO-8859-1
but the parser is forced to use UTF-8
. Isn't that wrong ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same but using an InputStream encoded in UTF-8, instead of a Reader.
/** | ||
* Issue 163: https://github.com/codehaus-plexus/plexus-utils/issues/163 | ||
* | ||
* Another case of bug #163: Reader generated with ReaderFactory.newReader and the right file encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BOM is ISO-8859-1
but the file is decoded with UTF-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file is encoded with UTF-8 without BOM, and XML entities are encoded in ISO-8859-1 (this is a valid combination). This is the test that reproduces the bug, which throws an exception as if the file encoding were UTF-8 with BOM (the invalid combination).
So as you said before, even if Maven is not affected by this, the bug in the parser is real.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is considered a valid combination by the xml spec. But this will produce incorrect results, as the input stream is most probably decoded using UTF-8 instead of ISO-8859-1.
I'm fine with that if that's what is desired. However, I think we can achieve both and I'll work on a fix.
throws IOException | ||
{ | ||
try ( Reader reader = | ||
ReaderFactory.newReader( new File( "src/test/resources/xml", "test-encoding-ISO-8859-1.xml" ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is using a simple Reader
, so the BOM is completely ignored by the reader, meaning the file will be read with UTF-8
while the BOM indicates it's encoded in ISO-8859-1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reproduces the same problem, showing that the Reader created by ReaderFactory.newReader() is affected too.
*/ | ||
@Test | ||
// @Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test is ok. It shows an error in the code.
The correct way to use the parser would be:
try ( FileInputStream is = new FileInputStream( new File( testResourcesDir, "007.xml" ) ) )
{
parser.setInput( is );
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the PR deleted the encoding detection inside the MXParser, this test case (part or the standard xml 1.0 test suite) is not supported. That's why is skipped from execution.
@Test | ||
public void testhst_lhs_008() | ||
// @Test | ||
public void testhst_lhs_008_newReader() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. The code enforces the usage of UTF-16
, so the underlying stream has to be properly encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same explanation as testhst_lhs_007()
* enable testhst_lhs_007, testhst_lhs_008 and testhst_lhs_009 for InputStream * disable those tests on readers, as readers bypass any encoding * do not try to discover the encoding used when the input is given a Reader * add an SIO-8859-1 encoded coment in the test xml (testEncodingISO_8859_1_newReader and testEncodingISO_8859_1_InputStream_encoded tests do decode it wrongly as they use UTF-8)
@belingueres I've pushed a commit on your branch, this keeps the detection for input streams for silently ignore incoherent encoding when a |
@gnodet Thanks for the patch! Changing to strict encoding detection is a bolder change than I was pursuing, but it LGTM. |
Now we have conflicts here. Needs review |
This is superseded with codehaus-plexus/plexus-xml#1 |