Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document a couple of recommended patterns of usage #269

Open
dralley opened this issue Mar 1, 2021 · 4 comments
Open

Document a couple of recommended patterns of usage #269

dralley opened this issue Mar 1, 2021 · 4 comments
Labels
documentation Issues about improvements or bugs in documentation help wanted

Comments

@dralley
Copy link
Collaborator

dralley commented Mar 1, 2021

Hi, I've started using this library for a personal project, and I've found that it's difficult to figure out how my code should be structured. I think it would be great if there were some docs that were a little more prescriptive about certain patterns you can use to accomplish certain goals (such as when you might want to use a state machine) or to provide clean abstractions in a larger non-trivial codebase.

One example: A pattern like this is really great for parsing nested objects using nested readers. (The provided nested reader example uses no abstractions - if you try to make it more sophisticated than it already is it would get messy very quickly).

Another example could be the state machine pattern used in this blog post: https://usethe.computer/posts/14-xmhell.html. The issue68 example is somewhat similar but the namespaces make it more difficult to understand what the general case might look like.

If you're not keen on putting too much detail in the quick-xml docs, then maybe just linking to a few projects / blog posts which use quick-xml "well" would be a good idea, or explain some of the general principles.

A sidenote:

Nearly every project I've looked at has some kind of implementation of get_element_text or get_attribute (#146) or write_text_element. I actually think it might be a good idea to include them in the library outright, but otherwise, showing some basic helpers like these in the examples would be great as well.

@tafia
Copy link
Owner

tafia commented Mar 4, 2021

I agree more documentation is always better. I am not sure I'll find time to write it soon but in a sketch:

  1. small xml / performance not critical / xml "simple" enough => serde
  2. the "items" are simple and not too nested => simple function with state machine
fn parse_items<R>(reader: R) -> Result<Vec<(String, String, Vec<String>)>, Error> {

    #[derive(Debug)]
    enum State {
        Start,
        Level0,
        Level1(String),
        Level2(String, String, Vec<String>),
    }

    let mut items = Vec::new();
    let mut state = State::Start;
    let mut buf = Vec::new();
    let mut txt_but = Vec::new();

    fn att_to_string(reader: &Reader<R>, event: BytesStart, name: &[u8]) -> Result<String, Error> {
        for a in event.attributes() {
            let a = a?;
            if a.key == name {
                return Ok(a.unescape_and_decode_value(reader)?);
            }
        }
        Ok(String::new())
    }

    loop {
        state = match (state, reader.read_event(buf)?) {
            (State::Start, Event::Start(e)) if e.name == b"level0" => State::Level0,
            (State::Level0, Event::Start(e)) if e.name == b"level1" => {
                State::Level1(att_to_string(reader, event, b"attr1")?)
            }
            (State::Level1(att1), Event::Start(e)) if e.name == b"level2" => {
                State::Level2(att1, att_to_string(reader, event, b"attr2")?, Vec::new())
            }
            (State::Level2(att1, att2, lev3), Event::Start(e)) if e.name == b"level3" => {
                lev3.push(reader.read_text(b"level3", &mut txt_buf)?);
                txt_buf.clear();
                State::Level2(att1, att2, lev3)
            }
            (State::Level2(att1, att2, lev3), Event::End(e)) if e.name() == b"level2" => {
                items.push((att1.clone(), att2, lev3)); // flatten level1
                State::Level1(att1)
            }
            (State::Level1(_), Event::End(e)) if e.name() == b"level1" => {
                State::Level0
            }
            (State::Level0, Event::End(e)) if e.name() == b"level0" => return Ok(items),
            (state, Event::Eof) => return Err(Error::UnexpectedEof(state)),
            state => state,
        };
        buf.clear();
    }
}
  1. Else => state machine split into many functions as specified in your example

In terms of occurrence I believe 1 >> 2 >> 3.

Thank you also for the sidenote, these functions are indeed very common and we would benefit having them implemented by default.

@dralley
Copy link
Collaborator Author

dralley commented Mar 4, 2021

Thanks, that is helpful! What about quick-xml without a state machine, just nested readers? I've seen a couple of projects doing it, and it's the way my code is written atm, but are there downsides? I haven't gotten around to strict validation or anything like that yet, if that is where it becomes helpful.

https://github.com/dralley/rpmrepo_rs/blob/master/src/metadata/repomd.rs#L235-L319

@tafia
Copy link
Owner

tafia commented Mar 11, 2021

Nested readers are good when there are really lot of levels.

I find them more complicated than simple state machines but this is subjective (matching the state and the event at once really shows what we're expecting). One potential drawback of nested parsers is that it is hard to reuse the same buf (hence in some case you may need to allocate large chunks over and over (tags_buf in your example is created many times).

@Mingun Mingun added documentation Issues about improvements or bugs in documentation help wanted labels May 21, 2022
@Mingun Mingun pinned this issue Oct 29, 2022
@phdavis1027
Copy link
Contributor

phdavis1027 commented Feb 27, 2024

I am interested by your comment that performance intensive code would be better served using Reader/Writer APIs rather than Serde. I have been using Serde for speed of development but am coming to realize that I should probably use these lower-level tools instead (the objects themselves are fairly simple but it's performance critical). However, I have a very large number of things that have to be parsed (the full protocol includes probably ~100). Do you have recommendations for implementing these parse_*/decode_* functions reusably/in a way that minimizes boilerplate, or will I have just have to buckle down and hand-parse everything?

Additionally, does something spring to mind for good patterns when writing with nested elements?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues about improvements or bugs in documentation help wanted
Projects
None yet
Development

No branches or pull requests

4 participants