Skip to content

Releases: James-LG/Skyscraper

v0.6.4

15 Jun 01:50
1c9bbc8
Compare
Choose a tag to compare

What's Changed

  • fix(html): Allow verbose doctype declaration by @James-LG in #33

Full Changelog: v0.6.3...v0.6.4

v0.6.3

16 Mar 16:48
3a5a3be
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.6.2...v0.6.3

v0.6.2

18 Feb 18:10
3b4e04b
Compare
Choose a tag to compare

Fixed

  • fix(xpath): Adjust text methods to match lxml by @James-LG in #30
    • Adds an itertext() method to XpathItemTreeNode that behaves like lxml's itertext method.
    • Adjusted XpathItemTreeNode::text() method such that it matches the behaviour of lxml's text() method.
    • XPath results returns nodes in document order
    • Text nodes automatically unescape commonly escaped characters such as gt; -> >
      • Added unescape_characters() and escape_characters() functions to easily move between them

Full Changelog: v0.6.1...v0.6.2

v0.6.1

16 Feb 01:48
7054e41
Compare
Choose a tag to compare

What's Changed

  • fix: Ensure stack does not overflow on Windows by @James-LG in #27

Full Changelog: v0.6.0...v0.6.1

v0.6.0

04 Jan 00:00
d33595c
Compare
Choose a tag to compare

Changelog

What's Changed

  • BREAKING: Complete xpath module rewrite by @James-LG in #24
    • Fixed #17: Allow the selection of text with xpath expressions. e.g. //div/text()
    • Fixed #15: Allow the selection of attributes with xpath expressions. e.g. //a/@href
    • Fixes the behaviour of indexes in xpath expressions. e.g. //div/span[1]
    • New implementation follows the official XPath specification as close as possible.

Full Changelog: v0.5.1...v0.6.0


v0.5.x -> 0.6.0 Migration Guide

A quick guide to upgrading through some of the major breaking changes introduced in v0.6.0.

Item Type

The biggest change is the return type. Before it was a list of items that could be either an HtmlTag or HtmlText. Now the items are a much more complicated type following the XPath specification.

Below is an overview of the returned item type XpathItem:

/// https://www.w3.org/TR/xpath-datamodel-31/#dt-item
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum XpathItem<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    ///
    ///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
    Node(Node<'tree>),

    /// A function item.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-function-item
    Function(Function),

    /// An atomic value.
    ///
    /// https://www.w3.org/TR/xpath-datamodel-31/#dt-atomic-value
    AnyAtomicType(AnyAtomicType),
}

/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
///
///  https://www.w3.org/TR/xpath-datamodel-31/#dt-node
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum Node<'tree> {
    /// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
    TreeNode(XpathItemTreeNode<'tree>),

    /// A node that is not part of an [`XpathItemTree`](crate::xpath::XpathItemTree).
    NonTreeNode(NonTreeXpathNode),
}

/// Nodes that are not part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum NonTreeXpathNode {
    /// An attribute node.
    AttributeNode(AttributeNode),

    /// A namespace node.
    NamespaceNode(NamespaceNode),
}

/// A node in the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash)]
pub struct XpathItemTreeNode<'a> {
    id: NodeId,

    /// The data associated with this node.
    pub data: &'a XpathItemTreeNodeData,
}

/// Nodes that are part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Hash, EnumExtract)]
pub enum XpathItemTreeNodeData {
    /// The root node of the document.
    DocumentNode(XpathDocumentNode),

    /// An element node.
    ///
    /// HTML tags are represented as element nodes.
    ElementNode(ElementNode),

    /// A processing instruction node.
    PINode(PINode),

    /// A comment node.
    CommentNode(CommentNode),

    /// A text node.
    TextNode(TextNode),
}

Xpath Item Tree

To facilitate the new XpathItem type, xpath expressions now must be passed an XpathItemTree rather than an HtmlDocument.

XpathItemTree implements From<&HtmlDocument>, so you can easily generate an XpathItemTree from a reference to an HtmlDocument. Note that this is a decently expensive operation, so you probably only want to perform this operation once per HtmlDocument if possible.

let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let results = expr.apply(&html_document)?;
+ let xpath_item_tree = XpathItemTree::from(&html_document);
+ let results = expr.apply(&xpath_item_tree)?;

Getting Text

Text nodes are a type of TreeNode. You can either match on the item, or use these convenient as_[variant] functions.

Other changes:

  • The function to retrieve text was renamed from get_text to just text, and get_all_text to all_text.
  • The function now returns a String rather than an Option<String>.
- let text = item.get_text(&html_document).unwrap();
+ let text = item.as_node()?.as_tree_node()?.text(&page);

Getting Attributes

Attribute nodes are a type of NonTreeNode. You can now either select these directly in the xpath expression using the attribute axis (new feature), or you can get them from an ElementNode.

- let attribute = item.get_attributes().unwrap().get("href").unwrap();
+ let element = item.as_node()?.as_tree_node()?.data.as_element_node()?;
+ let attribute = element.get_attribute("href").unwrap();

or alternatively, use xpath to select the attribute node

- let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let items = expr.apply(&html_document)?;
- let attribute = items[0].get_attributes().unwrap().get("href").unwrap();
+ let expr = xpath::parse("//td[@class='something']//span/@href").unwrap();
+ let items = expr.apply(&xpath_item_tree)?;
+ let attribute = items[0].as_node()?.as_non_tree_node()?.as_attribute_node()?.value;

v0.5.1

16 Dec 00:44
e1cb26f
Compare
Choose a tag to compare

What's Changed

  • Check if search index out of bounds by @masc-it in #21

New Contributors

Full Changelog: v0.5.0...v0.5.1

v0.5.0

22 Jul 17:28
f5439bc
Compare
Choose a tag to compare

What's Changed

Features

  • feat(html): Handle mismatched tags in #14
  • feat(html): Implements Display on HtmlDocument in #14
  • feat(xpath): Support wildcards in #18

Fixes

  • fix(html): Handle whitespace in tags in #19

Full Changelog: v0.4.0...v0.5.0

v0.4.0

28 Aug 21:36
3967080
Compare
Choose a tag to compare

Added

  • feat: Add clone to HtmlDocument and Xpath by @James-LG in #11

Full Changelog: v0.3.1...v0.4.0

v0.3.1

21 Jul 01:53
20b9108
Compare
Choose a tag to compare

Fixed

  • fix(html): Improve triangle bracket handling in text tokenizer by @James-LG in #10

Full Changelog: v0.3.0...v0.3.1

v0.3.0

09 Jul 16:09
40a90c1
Compare
Choose a tag to compare

Fixes:

  • Apply 1-based indexing by @dyens in #4
  • Allow root nodes to be selected with // searches by @James-LG in #8

Features:

  • Add get_attributes helper methods by @dyens in #7
  • Apply indexes in xpath::search by @James-LG in #8

New Contributors

  • @dyens made their first contribution in #4

Full Changelog: v0.2.1...v0.3.0