Releases: James-LG/Skyscraper
v0.6.4
v0.6.3
v0.6.2
Fixed
- fix(xpath): Adjust text methods to match lxml by @James-LG in #30
- Adds an
itertext()
method toXpathItemTreeNode
that behaves like lxml's itertext method. - Adjusted
XpathItemTreeNode::text()
method such that it matches the behaviour of lxml'stext()
method. - XPath results returns nodes in document order
- Text nodes automatically unescape commonly escaped characters such as
gt;
->>
- Added
unescape_characters()
andescape_characters()
functions to easily move between them
- Added
- Adds an
Full Changelog: v0.6.1...v0.6.2
v0.6.1
v0.6.0
Changelog
What's Changed
- BREAKING: Complete xpath module rewrite by @James-LG in #24
- Fixed #17: Allow the selection of text with xpath expressions. e.g.
//div/text()
- Fixed #15: Allow the selection of attributes with xpath expressions. e.g.
//a/@href
- Fixes the behaviour of indexes in xpath expressions. e.g.
//div/span[1]
- New implementation follows the official XPath specification as close as possible.
- Fixed #17: Allow the selection of text with xpath expressions. e.g.
Full Changelog: v0.5.1...v0.6.0
v0.5.x -> 0.6.0 Migration Guide
A quick guide to upgrading through some of the major breaking changes introduced in v0.6.0.
Item Type
The biggest change is the return type. Before it was a list of items that could be either an HtmlTag or HtmlText. Now the items are a much more complicated type following the XPath specification.
Below is an overview of the returned item type XpathItem
:
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-item
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum XpathItem<'tree> {
/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
///
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-node
Node(Node<'tree>),
/// A function item.
///
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-function-item
Function(Function),
/// An atomic value.
///
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-atomic-value
AnyAtomicType(AnyAtomicType),
}
/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
///
/// https://www.w3.org/TR/xpath-datamodel-31/#dt-node
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum Node<'tree> {
/// A node in the [`XpathItemTree`](crate::xpath::XpathItemTree).
TreeNode(XpathItemTreeNode<'tree>),
/// A node that is not part of an [`XpathItemTree`](crate::xpath::XpathItemTree).
NonTreeNode(NonTreeXpathNode),
}
/// Nodes that are not part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash, EnumExtract)]
pub enum NonTreeXpathNode {
/// An attribute node.
AttributeNode(AttributeNode),
/// A namespace node.
NamespaceNode(NamespaceNode),
}
/// A node in the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Clone, Hash)]
pub struct XpathItemTreeNode<'a> {
id: NodeId,
/// The data associated with this node.
pub data: &'a XpathItemTreeNodeData,
}
/// Nodes that are part of the [`XpathItemTree`].
#[derive(PartialEq, PartialOrd, Eq, Ord, Debug, Hash, EnumExtract)]
pub enum XpathItemTreeNodeData {
/// The root node of the document.
DocumentNode(XpathDocumentNode),
/// An element node.
///
/// HTML tags are represented as element nodes.
ElementNode(ElementNode),
/// A processing instruction node.
PINode(PINode),
/// A comment node.
CommentNode(CommentNode),
/// A text node.
TextNode(TextNode),
}
Xpath Item Tree
To facilitate the new XpathItem type, xpath expressions now must be passed an XpathItemTree
rather than an HtmlDocument
.
XpathItemTree
implements From<&HtmlDocument>
, so you can easily generate an XpathItemTree
from a reference to an HtmlDocument
. Note that this is a decently expensive operation, so you probably only want to perform this operation once per HtmlDocument
if possible.
let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let results = expr.apply(&html_document)?;
+ let xpath_item_tree = XpathItemTree::from(&html_document);
+ let results = expr.apply(&xpath_item_tree)?;
Getting Text
Text nodes are a type of TreeNode
. You can either match
on the item, or use these convenient as_[variant]
functions.
Other changes:
- The function to retrieve text was renamed from
get_text
to justtext
, andget_all_text
toall_text
. - The function now returns a
String
rather than anOption<String>
.
- let text = item.get_text(&html_document).unwrap();
+ let text = item.as_node()?.as_tree_node()?.text(&page);
Getting Attributes
Attribute nodes are a type of NonTreeNode
. You can now either select these directly in the xpath expression using the attribute axis (new feature), or you can get them from an ElementNode
.
- let attribute = item.get_attributes().unwrap().get("href").unwrap();
+ let element = item.as_node()?.as_tree_node()?.data.as_element_node()?;
+ let attribute = element.get_attribute("href").unwrap();
or alternatively, use xpath to select the attribute node
- let expr = xpath::parse("//td[@class='something']//span").unwrap();
- let items = expr.apply(&html_document)?;
- let attribute = items[0].get_attributes().unwrap().get("href").unwrap();
+ let expr = xpath::parse("//td[@class='something']//span/@href").unwrap();
+ let items = expr.apply(&xpath_item_tree)?;
+ let attribute = items[0].as_node()?.as_non_tree_node()?.as_attribute_node()?.value;
v0.5.1
v0.5.0
v0.4.0
v0.3.1
v0.3.0
Fixes:
- Apply 1-based indexing by @dyens in #4
- Allow root nodes to be selected with // searches by @James-LG in #8
Features:
New Contributors
Full Changelog: v0.2.1...v0.3.0