Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace in text tokenized as IGNORABLE_WHITESPACE in XmlReader #241

Open
westnordost opened this issue Oct 2, 2024 · 2 comments
Open

Comments

@westnordost
Copy link

westnordost commented Oct 2, 2024

When I have this XML

<user>dude &amp; &lt;dudette&gt;</user>

I expect to get the following events when I iterate through it via the XmlReader:

  1. START_DOCUMENT
  2. START_ELEMENT localName="user"
  3. TEXT text="dude "
  4. ENTITY_REF text="&"
  5. TEXT text=" "
  6. ENTITY_REF text="<"
  7. TEXT text="dudette"
  8. ENTITY_REF text=">"
  9. END_ELEMENT localName="user"
  10. END_DOCUMENT

However, number 5 doesn't turn up as a TEXT but as an IGNORABLE_WHITESPACE.

I think this is a bug, this is not an ignorable whitespace. Whitespaces between XML elements, such as <user>abc</user> <id>1234</id> would be ignorable.

(By the way, the existence of CDSECT and ENTITY_REF was a pitfall (aka footgun) for me, I assumed before that the XMLReader would already have all text content, i.e. I expected there would be just TEXT text="dude & <dudette>" and then END_ELEMENT.)

@westnordost
Copy link
Author

westnordost commented Oct 2, 2024

Note on the "by the way":

Just tested the behavior of org.xmlpull.v1.XmlPullParser:

When XmlPullParser.getEventType() is XmlPullParser.TEXT, XmlPullParser.getText() indeed returns the entire string, i.e. in the above example "dude & <dudette>", despite event types for CDSECT and ENTITY_REF also existing... 🤔
(The relevant documentation reveals that XmlPullParser has somewhat of two APIs for iterating through the events, one a bit more low level than the other)

pdvrieze added a commit that referenced this issue Nov 23, 2024
never be recorded as ignorable whitespace, even when parsed as separate
parts. This should fix #241.
@pdvrieze
Copy link
Owner

Sorry about the delay. I've just fixed it in dev. This will still parse it at separate events, but will note that the whitespace is delimited by entities and thus not (attempt to) detect whitespace (and thus not generate ignorable whitespace events).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants