Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AST node content public outside of the crate #175

Closed
wants to merge 2 commits into from

Conversation

runebaas
Copy link

@runebaas runebaas commented Mar 9, 2021

I'm currently building a static site generator, for some representational stuff i'm parsing the headers out of the document. Unfortunately the content on the AST node is internal to the crate, which prevents me from getting the data i need.

This change, making it public, would be much appreciated, but regardless if you decide to or not, thank you for maintaining the crate.

@kivikakk
Copy link
Owner

kivikakk commented Mar 9, 2021

I'm open to accepting changes like this, but I'd like to understand a little more what the motivation for this is.

Normally after a parse all Ast.contents fields will be empty -- their content is usually parsed out completely into NodeValue's payloads (e.g. the Vec<u8> payload on NodeValue::Text).

iirc, after finalize there should be nothing left in this field. Are you working with a partial parse ..?

@runebaas
Copy link
Author

This is what i'd like to do

fn get_document_title(document: &str) -> Result<String, Utf8Error> {
    let arena = Arena::new();
    let root = parse_document(&arena, document, &ComrakOptions::default());

    for node in root.children() {
        let value = node.data.clone().into_inner().value;
        let header = match value {
            Heading(c) => c,
            _ => continue,
        };

        if header.level == 1 {
            let data = node.data.clone().into_inner();
            // `data.content` is private
            let title = match std::str::from_utf8(&data.content) {
                Ok(title) => title,
                Err(e) => return Err(e)
            };
            
            return Ok(title.to_string());
        }
    }

    Ok("Untitled Document".to_string())
}

(i know, the code is hacky)

I don't entirely understand how to do this differently

@kivikakk
Copy link
Owner

Ah, I see. You'll want to pull the data from the Text node within the header, per the AST. The content field is not public because it's not intended to be a reliable source of data; it's for use by the parsing engine internally, and there won't be any guarantees made that it won't be removed or modified in some cases.

Trying to make that guarantee, which making it pub would do, would constrict the parser in future. Specifically, content maps one-to-one with the cmark_node content field in the reference CommonMark implementation, which similarly has no guarantees. Indeed, it's been removed entirely as of Jan 2020, so if the spec changed enough that I'd need to bring Comrak up-to-date with the reference implementation, it would vanish here too. As such, I'd like to continue not exposing these implementation details.

The solution is to collect the text from the embedded Text nodes, kind of like how the HTML formatter does it here:

comrak/src/html.rs

Lines 370 to 382 in 374c174

fn collect_text<'a>(&self, node: &'a AstNode<'a>, output: &mut Vec<u8>) {
match node.data.borrow().value {
NodeValue::Text(ref literal) | NodeValue::Code(ref literal) => {
output.extend_from_slice(literal)
}
NodeValue::LineBreak | NodeValue::SoftBreak => output.push(b' '),
_ => {
for n in node.children() {
self.collect_text(n, output);
}
}
}
}

I'd accept a PR that makes that public and usable outside the HTML formatter, but you can also get the effect yourself by reproducing it. Here's a full example:

extern crate comrak;

use comrak::{Arena, parse_document, ComrakOptions, nodes::{AstNode, NodeValue}};

fn main() {
    println!("{:?}", get_document_title("# Hello\n"));
    println!("{:?}", get_document_title("## Hello\n"));
    println!("{:?}", get_document_title("# `hi` **there**\n"));
}

fn get_document_title(document: &str) -> String {
    let arena = Arena::new();
    let root = parse_document(&arena, document, &ComrakOptions::default());

    for node in root.children() {
        let header = match node.data.clone().into_inner().value {
            NodeValue::Heading(c) => c,
            _ => continue,
        };

        if header.level != 1 {
            continue;
        }

        let mut text = Vec::new();
        collect_text(node, &mut text);

        // The input was already known good UTF-8 (document: &str) so comrak
        // guarantees the output will be too.
        return String::from_utf8(text).unwrap();
    }

    "Untitled Document".to_string()
}

fn collect_text<'a>(node: &'a AstNode<'a>, output: &mut Vec<u8>) {
    match node.data.borrow().value {
        NodeValue::Text(ref literal) | NodeValue::Code(ref literal) => {
            output.extend_from_slice(literal)
        }
        NodeValue::LineBreak | NodeValue::SoftBreak => output.push(b' '),
        _ => {
            for n in node.children() {
                collect_text(n, output);
            }
        }
    }
}

And the output:

"Hello"
"Untitled Document"
"hi there"

I hope this helps!

@runebaas
Copy link
Author

That helps a lot, thank you very much for the detailed explanation! 😄

From what you explained, it doesn't make sense to make the content field public, so I'll close this PR.

@runebaas runebaas closed this Mar 17, 2021
@kivikakk
Copy link
Owner

@runebaas Thank you very much for your contribution. :heart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants