Skip to content

Latest commit

 

History

History
167 lines (134 loc) · 5.94 KB

README.md

File metadata and controls

167 lines (134 loc) · 5.94 KB

🐰rabbit

Go Reference Go Report Card

An interpreted language written in Go - XPath 3.1 implementation for HTML

XML Path Language(XPath) 3.1 is W3C recommendation since 21 march 2017. The rabbit language is built for selecting HTML nodes with XPath syntax.

Overview

Rabbit language is built for HTML, not for XML. Since XPath 3.1 is targeted for XML, it was not possible to implement all the concepts listed in https://www.w3.org/TR/xpath-31/. But in most cases, it is fair enough for selecting HTML nodes with rabbit language.

For example)

  • //a
  • //div[@category='web']/preceding::node()[2]
  • let $abc := ('a', 'b', 'c') return fn:insert-before($abc, 4, 'z')

Basic Usage

// you can chaining xpath object. data is nil or []string
data := rabbit.New().SetDoc("uri/or/filepath.txt").Eval("//a").GetAll()
// if you expect evaled result is a sequence of html node, 
// use NodeAll() instead of DataAll() or GetAll()
nodes := rabbit.New().SetDoc("uri/or/filepath.txt").Eval("//a").NodeAll()
// with error check
x := rabbit.New()
x.SetDoc("uri/or/filepath.txt")
if len(x.Errors()) > 0 {
  // ... do something with errors (the x.Errors() type is []error)
}
x.Eval("//a")
if len(x.Errors()) > 0 {
  // ... do something with errors
}
data = x.DataAll()
// without SetDoc. Since document is not set in the context, 
// node related xpath expressions are not going to work.
x := rabbit.New()
data := x.Eval("1+1").Data()
// you can test simple xpath expressions using cli program
rabbit.New().SetDoc("uri/or/filepath.txt").CLI()

Features

What is supported

  1. Primary Expressions
    • Integer(1)
    • Decimal(1.1)
    • Double(1e1)
    • String("")
    • Boolean(true, false)
    • Variable($var)
    • Context Item(.)
    • Placeholder(?)
  2. Functions
    • Named Function(built in function - bif)
    • Inline Function(custom function)
    • Map
    • Array
    • Arrow operator(=>)
    • Simple Map Operator(!)
  3. Path Expressions
    • Forward Step(child::, descendant::, ...)
    • Reverse Step(parent::, ...)
    • Node Test
    • Predicate([])
    • Abbreviated Syntax(@, ..)
  4. Sequence Expressions(())
  5. Arithmetic Expressions
    • Additive(+, -)
    • Multiplicative(*, div, idiv, mod)
    • Unary(+, -)
  6. String Concatenation Expressions(||)
  7. Comparison Expressions
    • Value Compare(eq, ne, lt, le, gt, ge)
    • Node Compare(is, <<, >>)
    • General Compare(=, !=, <, <=, >, >=)
  8. Logical Expressions(and, or)
  9. For Expressions(for)
  10. Let Expressions(let)
  11. Conditional Expressions(if)
  12. Quantified Expressions(some, every)
  13. Lookup(?)

What is not supported

  1. Namespace
    Rabbit language doesn't care about prefixed tag names or xmlns attributes in tags. So, xmlns attribute is not treated as a namespace node, and a prefixed tag does not complain if no namespace for the prefix is specified in a document.

  2. Limited Types
    There is a bunch of data types in XPath data model. You can check all the types in https://www.w3.org/TR/xpath-datamodel-31/. Many of the types are not supported in Rabbit language and most of the data types in Rabbit language are simplified as string. It makes no sense to implement all the data types because there are no such things as XML Schema Definition(xsd) in HTML.

  3. Limited KindTest
    In the XPath 3.1 document, there are 10 kinds of KindTest. But namespace-node test, processing-instruction test, schema-attribute test, schema-element test is not supported in Rabbit language because our parsing engine(/x/net/html) does not recognize them.

  4. Sequence Type Check
    In XPath 3.1, you can specify data types in lnline function. It looks like this. function($a as xs:string) as xs:string {$a}. This syntax is not a part of the Rabbit language. The inline function should like this. function($a) {$a}.

  5. Node Test with Argument
    Node test with argument is not supported. For example, element(person), element(person, surgeon), element(*, surgeon), attribute(price), attribute(*, xs:decimal) are not allowed. But you can do element(), attribute().

  6. Wildcard Expressions
    Only * wildcard is allowed in the Rabbit language. NCName:*, *:NCName, BracedURILiteral* are not supported since namespace is not a big deal in the Rabbit language.

Notice

Attribute node is custom *html.Node type

Rabbit language support attribute node. But /x/net/html package has no such a type(it only has 6 kinds of nodes) and treats attribute as a field of an element node. So, in order to make an attribute as a node, I had to make a custom *html.Node type. It has the following fields.

  • Type: html.NodeType(7).
  • Parent: node(*html.Node) that is contain the attribute
  • FirstChild, LastChild: nil
  • PrevSibling, NextSibling: prev or next attribute node(*html.Node) of current one
  • Data: attribute key(string).
  • DataAtom: atomized Data(atom.Atom)
  • Namespace: ""(empty string)
  • Attr: Attr field contains only one html.Attribute item. Is has key, value pair for the attribute.

Not well-formed document will be transformed

Rabbit language uses the /x/net/html package for parsing HTML. So, the type of the selected node will be *html.Node. One thing that should know is that /x/net/html package wraps a document with html, head, body tags if it is not well-formed.

For example, if your document looks like this

<div>
  ...
</div>

/x/net/html package transforms the document to this internally.

<html>
  <head></head>
  <body>
    <div>
      ...
    </div>
  </body>
</html>

So, in this example, XPath expression /div has no result because the root node is an html, not div. Keep in mind this fact and otherwise, you can get confused.