-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework handling general entity references (&entity;
)
#766
base: master
Are you sure you want to change the base?
Conversation
da2a802
to
6d9a6ab
Compare
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #766 +/- ##
==========================================
- Coverage 60.21% 60.18% -0.03%
==========================================
Files 41 41
Lines 16021 16409 +388
==========================================
+ Hits 9647 9876 +229
- Misses 6374 6533 +159
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I finished work on the base part of the entities support. In this PR new |
I won't have a chance to fully review this for a couple of days. Quick question though, am I correct in thinking that this PR will mean that any time a text block contains one or more entity references, instead of the developer receiving one |
Yes, you are correct. But that does not mean that he/she will needed to construct the complete text themselves. In the next PR I plan to rename Because renames affects very many places, I want to do that in a separate PR to reduce noise in PR with new Borrow-only reader-only events will be useful also in that sense that I plan to add Because new |
src/reader/mod.rs
Outdated
Ok((_, false)) => { | ||
// We want to report error at `&`, but offset was increased, | ||
// so return it back (-1 for `&`) | ||
$self.state.last_error_offset = start - 1; | ||
Err(Error::Syntax(SyntaxError::UnclosedReference)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a code in ruffle in ruffle-rs/ruffle#10471 from @Aaron1011 that will be break (custom_unescape
function), because currently dangling &
will always return SyntaxError::UnclosedReference
. I think I should add a new configuration option for this here
b307997
to
51676d2
Compare
@dralley, what do you think about this? |
@@ -24,6 +24,9 @@ XML specification. See the updated `custom_entities` example! | |||
|
|||
- [#766]: Allow to parse resolved entities as XML fragments and stream events from them. | |||
- [#766]: Added new event `Event::GeneralRef` with content of [general entity]. | |||
- [#766]: Added new configuration option `allow_dangling_amp` which allows to have | |||
a `&` not followed by `;` in the textual data which is required for some applications | |||
for compatibility reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which applications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant case from #719 here
Sorry for not reviewing this promptly. Once the current set of PRs are merged, can you rebase? |
@dralley, you can review this. Initially I thought not to merge it until I implement other changes (in the follow-up PRs) that reworks the parser to reduce amount of releases with breaking changes, but because we already have enough amount of breaking changes, I think, it can be merged and included in next release. |
00506c0
to
c8fefad
Compare
<!ENTITY msg "hello world" > | ||
]> | ||
<test label="&msg;">&msg;</test> | ||
struct MyReader<'i> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned a future PR that would implement this functionality on Reader
while leaving RawReader
for a lower-level event stream. I presume that at that point this example would basically become trivial, as opposed to... this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point just being that the current example as-is is a bit much to expect people to implement or copy and paste, and I just want to check my understanding that it's not a long-term solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not really decided yet whether to leave this example as it is to demonstrate how the reader stack can be implemented if for some reason the standard solution does not work, or rewrite it to a new API. If it remains as it is, then there will be a mention of the standard way
@@ -54,7 +54,7 @@ fn parse_document_from_str(doc: &str) -> XmlResult<()> { | |||
} | |||
} | |||
Event::Text(e) => { | |||
criterion::black_box(e.unescape()?); | |||
criterion::black_box(e.decode()?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About Reader
returning merged events in the future - does that mean that the .decode()
would no longer serve a functional purpose if there was no "decoding" (in the text encoding, utf-8 etc. sense) to do? (because the entities are expanded already).
And would there be any way to, say, return the original raw unexpanded XML between two tags from that wrapper, or would that be impossible without dropping to the RawReader
level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think, decode
method will gone
And would there be any way to, say, return the original raw unexpanded XML between two tags from that wrapper, or would that be impossible without dropping to the
RawReader
level?
I think, it could be possible to implement that, but only if something requested that. I think, that this is niche feature. The API could be a special method that need to call instead of read_event()
to get an unparsed content.
…onstruction in a text failures (16): serde-de (9): borrow::escaped::element borrow::escaped::top_level resolve::resolve_custom_entity trivial::text::byte_buf trivial::text::bytes trivial::text::string::field trivial::text::string::naked trivial::text::string::text xml_schema_lists::element::text::string serde-migrated (1): test_parse_string serde-se (5): with_root::char_amp with_root::char_gt with_root::char_lt with_root::str_escaped with_root::tuple --doc (1): src\de\resolver.rs - de::resolver::EntityResolver (line 13)
…xpanded entities
Text events produces by the Reader can not contain escaped data anymore, all such data is represented by the Event::GeneralRef
Fixed (18): serde-de (9): borrow::escaped::element borrow::escaped::top_level resolve::resolve_custom_entity trivial::text::byte_buf trivial::text::bytes trivial::text::string::field trivial::text::string::naked trivial::text::string::text xml_schema_lists::element::text::string serde-migrated (1): test_parse_string serde-se (5): with_root::char_amp with_root::char_gt with_root::char_lt with_root::str_escaped with_root::tuple --doc (3): src\de\resolver.rs - de::resolver::EntityResolver (line 13)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable.
Initially I thought not to merge it until I implement other changes (in the follow-up PRs) that reworks the parser to reduce amount of releases with breaking changes, but because we already have enough amount of breaking changes, I think, it can be merged and included in next release.
I personally would prefer to see the Reader
/ RawReader
split happen prior to the next release, or even in this PR in subsequent commits, to reduce the amount of massively breaking API changes going on. I'd rather not split the changes across many releases.
From a purely psychological level I'd be deeply annoyed by having to change code that uses Reader
to handle the existence of references throughout the returned event stream, and then subsequently needing to change it back later to something closer to the original approach. That particular type of churn (needing to make a change as a user and then un-make it later) is best avoided.
Totally agree. Ok, then I'll merge it when follow-up PR will be ready |
This is a big change in handling general entity references and character references. Open PR early to get feedback.
With this changes we can correctly parse document
as equivalent normalized document
<root/>
The updated
custom_entities
example shows how it would be possible to implement requirement from the specification about parsed general entities. Serde deserializer did not updated yet, because this is not trivial part and probably that will be done in another PR.Of course, such change probably makes the performance worse, I didn't measure impact yet.
Closes #667