Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalize_line_end for unescape and test #807

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

yorkz1994
Copy link

Regarding #806
I added a normalize_line_end function in escape module and related tests.
If unescape function is called, then line end will be normalized.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add more tests for the actual usage of this function, i.e. test what would be returned when call public API methods of Attribute and events. Now it is hard to ensure that this will not break due to internal refactorings.

Test failures should be evaluated and fixed. Also, issue #670 seems related, it would be great if you can evaluate how this change is applicable to it.

In the end, could you add a changelog entry. Do not forgot that Changelog.md is a usual markdown file, so does not forgot to update section with links -- they are not linked automatically to the GH PRs/issues.

src/escape.rs Outdated Show resolved Hide resolved
@yorkz1994
Copy link
Author

I think the failed test is because we did the line end normalization after the unescape, but unescape can contain \r, so these new \rs should not be normalized. We have to do the normalization only for data in the not unescape parts.

@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.36%. Comparing base (3ebe221) to head (d3ee1ad).
Report is 3 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #807      +/-   ##
==========================================
+ Coverage   60.17%   60.36%   +0.18%     
==========================================
  Files          41       41              
  Lines       15958    15980      +22     
==========================================
+ Hits         9603     9646      +43     
+ Misses       6355     6334      -21     
Flag Coverage Δ
unittests 60.36% <100.00%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. But I would also like to see the tests when use API of Attribute and BytesText, which are listed here.

Changelog.md Outdated Show resolved Hide resolved
Co-authored-by: Mingun <[email protected]>
@yorkz1994
Copy link
Author

Good. But I would also like to see the tests when use API of Attribute and BytesText, which are listed here.

Sorry, don't know what do you want me to do. I am not familiar with XML specification on attributes. What I did is just do normalization of the line end before your further unescape work. So it removes \r before unescape means it is just like no such character in your input at all, so your previous logic should continue to work. Now all your previous cases passed and the added case for testing normalization also works. I don't see anything I can do.

@Mingun
Copy link
Collaborator

Mingun commented Sep 26, 2024

I like just see tests that will call unescape family of methods on Attribute and BytesText which is get from the reader and check that the result does not contain \r. Something like:

// XML with \n \r\n and \r style newlines in various places
const XML: &str = "...";

let mut reader = Reader::from_str(XML);

match reader.read_event().unwrap() {
  Event::Start(event) => {
    let iter = event.attributes();
    let a = iter.next().unwrap();
    #[cfg(not(feature = "encoding"))]
    assert_eq!(a.unescape_value(), "...");
    assert_eq!(a.decode_and_unescape_value(), "...");
  }
  event => panic!("Expected Start, found {:?}", event),
}

match reader.read_event().unwrap() {
  Event::Text(event) => assert_eq!(event.unescape(), "..."),
  event => panic!("Expected Text, found {:?}", event),
}

@yorkz1994
Copy link
Author

Like I said, I am not familiar with XML specification, so I don't know where is the proper places to put \r in it and what should be expected after unescape if one appears. For example is it valid to put a \r in a attribute key or value, in tag name? If you know better than me how about you try it yourself.

@Mingun
Copy link
Collaborator

Mingun commented Sep 27, 2024

Just add it everywhere where spaces can occur, we are not not talking about correctness for now (this is another question, we are definitely do not process everything according to the specification). I only want to have a starting point and be sure that this feature worked as assumed when you would use actual API.

@yorkz1994
Copy link
Author

Could you provide such input data. I don't want to create such invalid XML. Normally I can think is that the \r will only appear in BytesText due to line end diffs in OS.

@Mingun
Copy link
Collaborator

Mingun commented Sep 27, 2024

<root attribute="\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>

@yorkz1994
Copy link
Author

yorkz1994 commented Sep 27, 2024

f970370
This commit is due to forgot to normalize if input has nothing to unescape.

I add a case from your input, don't know the result is your expected:

#[test]
fn line_ends() {
    const XML: &str = "<root attribute=\"\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n\">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>";
    let mut reader = Reader::from_str(XML);
    match reader.read_event().unwrap() {
        Event::Start(event) => {
            let mut iter = event.attributes();
            let a = iter.next().unwrap().unwrap();
            #[cfg(not(feature = "encoding"))]
            assert_eq!(
                a.unescape_value().unwrap(),
                "\n\n\nvalue1\n\n\nvalue2\n\n\n"
            );
            assert_eq!(
                a.decode_and_unescape_value(reader.decoder()).unwrap(),
                "\n\n\nvalue1\n\n\nvalue2\n\n\n"
            );
        }
        event => panic!("Expected Start, found {:?}", event),
    }
    match reader.read_event().unwrap() {
        Event::Text(event) => {
            assert_eq!(event.unescape().unwrap(), "\n\n\nvalue3\n\n\nvalue4\n\n\n")
        }
        event => panic!("Expected Text, found {:?}", event),
    }
}

There is a case in serde-se.rs always fail. Don't know if we should modify serde implementation?

serialize_as!(tuple:
        // Use to_string() to get owned type that is required for deserialization
        ("<\"&'>".to_string(), "with\t\r\n spaces", 3usize)
        => "<root>&lt;\"&amp;'&gt;</root>\
            <root>with\t\r\n spaces</root>\
            <root>3</root>");
---- with_root::tuple stdout ----
thread 'with_root::tuple' panicked at tests/serde-se.rs:1956:5:
deserialization roundtrip: Custom("invalid type: string \"with\\t\\n spaces\", expected a borrowed string")

@yorkz1994
Copy link
Author

The failed test is because there is \r in the roundtrip test. It is impossible to have equal serialize and deserialize due to line end normalization. Therefore I have to remove the \r in roundtrip test to get the case pass.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a case from your input, don't know the result is your expected:

Yes, I think, that it is expected results. According to the XML specification XML processor will never see raw \r, it could appeared in data only if would be escaped as a character reference (note: XML specification names each "XML source" (for example "XML file") as "entity". XML is composed from several "entities" -- main document itself and any other documents loaded via DTD definitions).

Adding a link to https://www.w3.org/TR/xml11/#sec-line-ends somewhere near to the code that handles normalization also would be useful. Well, actually, it describes the 5 combinations of LOF characters, that should be normalized to \n:

  • \r\n
  • \r\u0085 (UTF-8: \r\xC2\x85)
  • \r
  • \u0085 (UTF-8: \xC2\x85)
  • \u2028 (UTF-8: \xE2\x80\xA8)

Strictly speaking, that means that we should look for the 5 bytes if we would like to unescape and normalize line ends in one loop: &, ;, \r, \xC2 (or \x85), \xE2 (or \xA8). Unfortunately, memchr supports searching only 3 bytes at once. But at least we can (and should, otherwise we get two line ends instead of one if user tried to normalize other line ends that we do not process) process \r\xC2\x85 in the same way as \r\n.

If it turns out that unescape and line end normalization cannot be combined in one loop, then we should normalize all possible line ends because that require searching only 3 different bytes, which memchr is supporting.

Actually, maybe using some bit tricks, like describing in this article it would be possible to work around memchr constraint, but I'm not experienced enough right now to apply them. Maybe you?

Therefore I have to remove the \r in roundtrip test to get the case pass.

That's fine, but probably should add a comment, that \r excluded from whitespace characters because we performs line end normalization.

src/escape.rs Outdated
@@ -266,7 +266,7 @@ where
unescaped = Some(String::with_capacity(raw.len()));
}
let unescaped = unescaped.as_mut().expect("initialized");
unescaped.push_str(&raw[last_end..start]);
unescaped.push_str(&normalize_line_end(&raw[last_end..start]));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Escaping may be on the hot path, so it is not fine that we can allocate new owned strings just to get a reference and put it to already allocated string. Would it possible instead use already managed Cow and pass it to the normalize_line_end? Then upgrading Cow from borrowed to owned could be performed either in unescape, or in normalize_line_end and that upgrade always will be performed only once.

Also, do we really need two loops? memchr can search for three bytes at once and we search for three bytes: &, ; and \r. So it could be possible to perform unescaping and line end normalization in one loop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do it. What if there is nothing to unescape? You still need to call normalize_line_end for such input in the end, then what is the correct function parameter for normalize_line_end so that it can handle both situations?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, don't understand what the trouble? unescape will always be called when we request data via API methods. If cycling search of &, ;, or \r will not find anything, than it is really nothing to change in the data (I assume that cycles for unescaping and normalization merged in one).

Even if did not merge loops, normalize_line_end can accept &mut Cow<str> and change it instead of creating an own Cow by taking reference to &str and returning its own Cow<str>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am talking about the not merge loops, as you point out that there are more line end situations to consider, it is beyond my knowledge to do SIMD stuff myself to merge the loops.

Current function signature is: fn normalize_line_end(input: &str) -> Cow<str>
Change to what? How to call it?

Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.

Copy link
Collaborator

@Mingun Mingun Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, read my &mut Cow<str> as &mut Option<String>. So the whole code will look like:

fn unescape_with(...) {
  let mut unescaped = None;
  for ... {
    ...
    if need to unescape {
      if unescaped.is_none() {
        unescaped = Some(String::with_capacity(raw.len()));
      }
      let unescaped = unescaped.as_mut().expect("initialized");
    }
    ...
    normalize_line_end(&mut unescaped, raw.len());
    ...
  }
}

fn normalize_line_end(input: &mut Option<String>, raw_len: usize) {
  if need to normalize {
    if input.is_none() {
       input = Some(String::with_capacity(raw_len))
    }
    let normalized = input.as_mut().expect("initialized");
    ...
  }
}

it is beyond my knowledge to do SIMD stuff myself to merge the loops.

Actually, SIMD and merging loops are two separate conception, it is not necessary to explicitly apply SIMD techniques if you not aware of them, it's totally fine. I only tried to be helpful if you aware of it.

Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.

The requirement in only is not to make things worse. First of all we should not make it impossible to correctly process input (that is why we should at least process \r\u0085 because if we turn it into \n\u0085, the user will see two newlines in data instead of one, because specification allows \n, \r\u0085 and \u0085 be served as (one) newline characters). The second, we should to try that without much impact to performance, so if we can avoid unnecessary allocations we should do that.

If you're having a hard time figuring out how to get there, that's fine, it could be so. Thanks for opening PR anyway. I will consider it later when I return to active phase on the project (currently working on other project) and see how to improve it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for showing the code.
Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?
Then you still have to call normlize_line_end to normalize the original input. How to you put original input raw str as parameter to your current unescape function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?

Unescape function decides where it is need to unescape something or not. It itself called always (well, because user requests unescaped and normalized data by demand, it will be called by demand, but that "demand" means "unesсape [and normalize] if there is anything")

How to you put original input raw str as parameter to your current unescape function?

Cannot understand what confuses you here. Calls to unescape will not be changed and it is already called whenever necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub fn unescape_with<'input, 'entity, F>(
    raw: &'input str,
    mut resolve_entity: F,
) -> Result<Cow<'input, str>, EscapeError>
where
    // the lifetime of the output comes from a capture or is `'static`
    F: FnMut(&str) -> Option<&'entity str>,
{

Nothing to escape means there is no & and ; in raw input. It will not go inside while below. So normalize will not happen to input here.

while let Some(start) = iter.by_ref().find(|p| bytes[*p] == b'&') {
        match iter.next() {
            Some(end) if bytes[end] == b';' => {
                // append valid data
                if unescaped.is_none() {
                    unescaped = Some(String::with_capacity(raw.len()));
                }
                let unescaped = unescaped.as_mut().expect("initialized");
                unescaped.push_str(&normalize_line_end(&raw[last_end..start]));

                // search for character correctness
                let pat = &raw[start + 1..end];
                if let Some(entity) = pat.strip_prefix('#') {
                    let codepoint = parse_number(entity).map_err(EscapeError::InvalidCharRef)?;
                    unescaped.push_str(codepoint.encode_utf8(&mut [0u8; 4]));
                } else if let Some(value) = resolve_entity(pat) {
                    unescaped.push_str(value);
                } else {
                    return Err(EscapeError::UnrecognizedEntity(
                        start + 1..end,
                        pat.to_string(),
                    ));
                }

                last_end = end + 1;
            }
            _ => return Err(EscapeError::UnterminatedEntity(start..raw.len())),
        }
    }

Will not go inside here either because nothing has been unescaped

if let Some(mut unescaped) = unescaped {    
    if let Some(raw) = raw.get(last_end..) {
        unescaped.push_str(&normalize_line_end(raw));
    }
    Ok(Cow::Owned(unescaped))
} else {

Must run normalize here, so how to call your changed function fn normalize_line_end(input: &mut Option<String>, raw_len: usize) here with raw as parameter?

        Ok(normalize_line_end(raw))
    }
}

tests/reader-attributes.rs Outdated Show resolved Hide resolved
tests/reader-attributes.rs Outdated Show resolved Hide resolved
Comment on lines 4 to 7
use quick_xml::events::{
BytesEnd,
Event::{self, *},
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be reverted, because of changes below.

src/escape.rs Outdated Show resolved Hide resolved
@yorkz1994
Copy link
Author

I did more line ends normalization.
Also changed the normalization implementation to use iterator to avoid extra allocation during normalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants