Incude line and column info from the original file #89

lazarljubenovic · 2018-03-30T21:05:39Z

Thanks for this great library!

It would be nice if parsed nodes contained information about the start and end of themselves.
Every node could have an additional property pos, something along the lines of the following.

interface Node {
  type: string,
  pos: {
    start: { line: number, col: number },
    end: { line: number, col: number },
  },
}

The text was updated successfully, but these errors were encountered:

andrejewski · 2018-03-31T01:01:10Z

@lazarljubenovic Thanks for making this issue. Having column/line numbers would definitely play to Himalaya's strength in being an accurate-representation parser.

I don't think implementing this would be too hard, it's two more things to book-keep during parsing. My question is for the Element node, should we be granular enough to provide:

interface Element {
  ...,
  startTagPosition: {
    start: { line: number, col: number },
    end: { line: number, col: number },
  },
  endTagPosition: {
    start: { line: number, col: number },
    end: { line: number, col: number },
  }
}

Along with the Node.position.

This is <i>sample markup</i>.
^      ^^ ^^           ^^  ^^
a      ab bc           cd  de

I cannot say I'll get to this immediately, but it is worth adding.

lazarljubenovic · 2018-03-31T10:51:37Z

Hmm. Interesting question. If we add that, then one might expect to do a similar thing for attributes' key-value pairs. In which case it's not obvious how to handle the tag itself: do we need a location for the tag name only? Are "s (which are optional) part of the attribute's value? What if there's no closing tag?

Curently Himalaya doesn't preserve any info about the way HTML was written (did you include a closing tag? did you wrap attribute value in quotes?), so it's a bit tricky to see how to do this as this is info about the original file, not the compiled JSON (and a first of such kind we're introducing here). There's no other convention established yet to compare it to.

Having the above in mind, this is what I suggest. (Lowercase letters are the "start" markers and their corresponding uppercase letter is the "end" marker. They are on separate lines to allow multiple markers on the same character.)

This is <i>sample markup</i>!
^      ^^  ^           ^   ^^
a      Ab  c           C   Bd
                            D

[A] - Text with content "This is"
[B] - Element with tag "i", no attributes and a single child [C]
[c] - Text with content "sample markup"
[D] - Text with content "!"

This <div>is <span>sample markup</div>
^   ^^    ^ ^^     ^           ^     ^
a   Ab    c Cd     e           D     B
                               E

[A] - Text with content "This "
[B] - Element with tag "div", no attributes and children [C] and [D]
[C] - Text with content "is "
[D] - Element with tag "span", no attributes and a single child [E]
[E] - Text with content "sample markup"

This <div class=foo tabindex="-1">is sample markup
^   ^^    ^       ^ ^           ^ ^              ^
a   Ab    d       D e           E c              B
                                                 C

[A] - Text with content "This "
[B] - Element with tag "div", attributes [D] and [E] and a single child [C]
[C] - Text with content "is sample markup"
[D] - Attribute with key "class" and value "foo"
[E] - Attribute with key "tabindex" and value "-1"

In my opinion, making it more sophisticated than this would be peeking into the lexer. Maybe it will be useful in the future to add all lexical nodes which participated in creating a parsed node, and see additional info there.

It's just my two cents, of course.

lazarljubenovic · 2018-11-30T20:14:32Z

Hey, thanks a lot for adding this! 🎉 Sorry for the super-late reply 😓

andrejewski mentioned this issue Apr 5, 2018

Add line, column, and index numbers to nodes #90

Merged

andrejewski closed this as completed in #90 Apr 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incude line and column info from the original file #89

Incude line and column info from the original file #89

lazarljubenovic commented Mar 30, 2018

andrejewski commented Mar 31, 2018

lazarljubenovic commented Mar 31, 2018

lazarljubenovic commented Nov 30, 2018

Incude line and column info from the original file #89

Incude line and column info from the original file #89

Comments

lazarljubenovic commented Mar 30, 2018

andrejewski commented Mar 31, 2018

lazarljubenovic commented Mar 31, 2018

lazarljubenovic commented Nov 30, 2018