Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BREAKING: Complete xpath module rewrite #24

Merged
merged 45 commits into from
Jan 3, 2024
Merged

BREAKING: Complete xpath module rewrite #24

merged 45 commits into from
Jan 3, 2024

Conversation

James-LG
Copy link
Owner

@James-LG James-LG commented Dec 29, 2023

The goal of this rewrite is to bring the implementation of the xpath module in line with the official xpath specification as defined in https://www.w3.org/TR/2017/REC-xpath-31-20170321/.

The main advantage of doing this is that it makes supporting more features is easier when you can follow the spec (obviously!).

One of the main limitations of the old xpath module was that it could only return "Text" or "Tag" nodes, which means there's no way to select other things that xpath supports like attributes. This rewrite makes that possible, at the cost of some added complexity on the return types.

Fixes #17
Fixes #15

It also fixes indexing which was previously being applied to the total set of items after every step, rather than per parent node, as mentioned in #21.

* Also removes XPathResult and makes all expressions return XpathItemSet
@James-LG
Copy link
Owner Author

James-LG commented Dec 29, 2023

TODO:

  • List of supported (or unsupported) xpath features.
  • Migration guide
  • Replace any crate git dependencies with versioned dependencies

@James-LG
Copy link
Owner Author

James-LG commented Jan 3, 2024

Migration Guide Draft

Item Type

The biggest change is the return type. Before it was a list of items that could be either an HtmlTag or HtmlText. Now the items are a much more complicated type following the XPath specification.

Below is an overview of the returned item type XpathItem:

/// https://www.w3.org/TR/xpath-datamodel-31/#dt-item
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum XpathItem<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    ///
    ///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
    Node(Node<'tree>),

    /// A function item.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-function-item
    Function(Function),

    /// An atomic value.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-atomic-value
    AnyAtomicType(AnyAtomicType),
}

/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
///
///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum Node<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    TreeNode(XpathItemTreeNode<'tree>),

    /// A node that is not part of an [`XpathItemTree`](crate::xpath::XpathItemTree).
    NonTreeNode(NonTreeXpathNode),
}

/// Nodes that are not part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum NonTreeXpathNode {
    /// An attribute node.
    AttributeNode(AttributeNode),

    /// A namespace node.
    NamespaceNode(NamespaceNode),
}

/// A node in the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash)]
pub struct XpathItemTreeNode<'a> {
    id: NodeId,

    /// The data associated with this node.
    pub data: &'a XpathItemTreeNodeData,
}

/// Nodes that are part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Hash, EnumExtract)]
pub enum XpathItemTreeNodeData {
    /// The root node of the document.
    DocumentNode(XpathDocumentNode),

    /// An element node.
    ///
    /// HTML tags are represented as element nodes.
    ElementNode(ElementNode),

    /// A processing instruction node.
    PINode(PINode),

    /// A comment node.
    CommentNode(CommentNode),

    /// A text node.
    TextNode(TextNode),
}

Xpath Item Tree

To facilitate the new XpathItem type, xpath expressions now must be passed an XpathItemTree rather than an HtmlDocument.

XpathItemTree implements From<&HtmlDocument>, so you can easily generate an XpathItemTree from a reference to an HtmlDocument. Note that this is a decently expensive operation, so you probably only want to perform this operation once per HtmlDocument if possible.

let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let results = expr.apply(&html_document)?;
+ let xpath_item_tree = XpathItemTree::from(&html_document);
+ let results = expr.apply(&xpath_item_tree)?;

Getting Text

Text nodes are a type of TreeNode. You can either match on the item, or use these convenient as_[variant] functions.

Other changes:

  • The function to retrieve text was renamed from get_text to just text, and get_all_text to all_text.
  • The function now returns a String rather than an Option<String>.
- let text = item.get_text(&html_document).unwrap();
+ let text = item.as_node()?.as_tree_node()?.text(&page);

Getting Attributes

Attribute nodes are a type of NonTreeNode. You can now either select these directly in the xpath expression using the attribute axis (new feature), or you can get them from an ElementNode.

- let attribute = item.get_attributes().unwrap().get("href").unwrap();
+ let element = item.as_node()?.as_tree_node()?.data.as_element_node()?;
+ let attribute = element.get_attribute("href").unwrap();

or alternatively, use xpath to select the attribute node

- let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let items = expr.apply(&html_document)?;
- let attribute = items[0].get_attributes().unwrap().get("href").unwrap();
+ let expr = xpath::parse("//td[@class='something']//span/@href").unwrap();
+ let items = expr.apply(&xpath_item_tree)?;
+ let attribute = items[0].as_node()?.as_non_tree_node()?.as_attribute_node()?.value;

@James-LG James-LG merged commit d33595c into master Jan 3, 2024
1 check passed
@James-LG James-LG deleted the james/nom branch January 3, 2024 23:51
James-LG added a commit that referenced this pull request Feb 10, 2024
BREAKING: Complete xpath module rewrite
James-LG added a commit that referenced this pull request Feb 10, 2024
BREAKING: Complete xpath module rewrite
James-LG added a commit that referenced this pull request Feb 10, 2024
BREAKING: Complete xpath module rewrite
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xpath: how do you select the text of a node? xpath: Cannot select attribute?
1 participant