Skip to content

v0.6.0

Compare
Choose a tag to compare
@James-LG James-LG released this 04 Jan 00:00
· 28 commits to master since this release
d33595c

Changelog

What's Changed

  • BREAKING: Complete xpath module rewrite by @James-LG in #24
    • Fixed #17: Allow the selection of text with xpath expressions. e.g. //div/text()
    • Fixed #15: Allow the selection of attributes with xpath expressions. e.g. //a/@href
    • Fixes the behaviour of indexes in xpath expressions. e.g. //div/span[1]
    • New implementation follows the official XPath specification as close as possible.

Full Changelog: v0.5.1...v0.6.0


v0.5.x -> 0.6.0 Migration Guide

A quick guide to upgrading through some of the major breaking changes introduced in v0.6.0.

Item Type

The biggest change is the return type. Before it was a list of items that could be either an HtmlTag or HtmlText. Now the items are a much more complicated type following the XPath specification.

Below is an overview of the returned item type XpathItem:

/// https://www.w3.org/TR/xpath-datamodel-31/#dt-item
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum XpathItem<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    ///
    ///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
    Node(Node<'tree>),

    /// A function item.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-function-item
    Function(Function),

    /// An atomic value.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-atomic-value
    AnyAtomicType(AnyAtomicType),
}

/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
///
///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum Node<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    TreeNode(XpathItemTreeNode<'tree>),

    /// A node that is not part of an [`XpathItemTree`](crate::xpath::XpathItemTree).
    NonTreeNode(NonTreeXpathNode),
}

/// Nodes that are not part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum NonTreeXpathNode {
    /// An attribute node.
    AttributeNode(AttributeNode),

    /// A namespace node.
    NamespaceNode(NamespaceNode),
}

/// A node in the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash)]
pub struct XpathItemTreeNode<'a> {
    id: NodeId,

    /// The data associated with this node.
    pub data: &'a XpathItemTreeNodeData,
}

/// Nodes that are part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Hash, EnumExtract)]
pub enum XpathItemTreeNodeData {
    /// The root node of the document.
    DocumentNode(XpathDocumentNode),

    /// An element node.
    ///
    /// HTML tags are represented as element nodes.
    ElementNode(ElementNode),

    /// A processing instruction node.
    PINode(PINode),

    /// A comment node.
    CommentNode(CommentNode),

    /// A text node.
    TextNode(TextNode),
}

Xpath Item Tree

To facilitate the new XpathItem type, xpath expressions now must be passed an XpathItemTree rather than an HtmlDocument.

XpathItemTree implements From<&HtmlDocument>, so you can easily generate an XpathItemTree from a reference to an HtmlDocument. Note that this is a decently expensive operation, so you probably only want to perform this operation once per HtmlDocument if possible.

let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let results = expr.apply(&html_document)?;
+ let xpath_item_tree = XpathItemTree::from(&html_document);
+ let results = expr.apply(&xpath_item_tree)?;

Getting Text

Text nodes are a type of TreeNode. You can either match on the item, or use these convenient as_[variant] functions.

Other changes:

  • The function to retrieve text was renamed from get_text to just text, and get_all_text to all_text.
  • The function now returns a String rather than an Option<String>.
- let text = item.get_text(&html_document).unwrap();
+ let text = item.as_node()?.as_tree_node()?.text(&page);

Getting Attributes

Attribute nodes are a type of NonTreeNode. You can now either select these directly in the xpath expression using the attribute axis (new feature), or you can get them from an ElementNode.

- let attribute = item.get_attributes().unwrap().get("href").unwrap();
+ let element = item.as_node()?.as_tree_node()?.data.as_element_node()?;
+ let attribute = element.get_attribute("href").unwrap();

or alternatively, use xpath to select the attribute node

- let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let items = expr.apply(&html_document)?;
- let attribute = items[0].get_attributes().unwrap().get("href").unwrap();
+ let expr = xpath::parse("//td[@class='something']//span/@href").unwrap();
+ let items = expr.apply(&xpath_item_tree)?;
+ let attribute = items[0].as_node()?.as_non_tree_node()?.as_attribute_node()?.value;