From d47adea4beb8031b1dbe3d53881b63827e51744a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?St=C3=A9phaneDucasse?= Date: Thu, 16 Mar 2023 14:12:13 +0100 Subject: [PATCH] fixing the migration to microdown --- Chapters/Scraping.pillar | 238 ------------------ Chapters/Scraping2.pillar | 179 -------------- Chapters/XPath.pillar | 502 -------------------------------------- index.pillar => index.md | 10 +- pillar.conf | 4 +- 5 files changed, 7 insertions(+), 926 deletions(-) delete mode 100644 Chapters/Scraping.pillar delete mode 100644 Chapters/Scraping2.pillar delete mode 100644 Chapters/XPath.pillar rename index.pillar => index.md (83%) diff --git a/Chapters/Scraping.pillar b/Chapters/Scraping.pillar deleted file mode 100644 index f58039f..0000000 --- a/Chapters/Scraping.pillar +++ /dev/null @@ -1,238 +0,0 @@ -!! Scraping HTML - -Internet pages provide a lot of information and often you would like to be able to access and manipulate it in another form than HTML: HTML is just plain verbose. What you would like is to get access to only the information you are interested in and get the results in a form that you can easily build more software. This is the objective of HTML scraping. In Pharo you can scrape web pages using different libraries such as XMLParser and SOUP. -In this chapter we will show you how we can do that using XMLParser to locate and collect the data we need and JSON to format and output the information. - -This chapter has been originally written by Peter Kenny and we thank him for sharing with the community this little tutorial. - -!!! Getting started -You can use the Catalog browser to load XMLParserHTML and NeoJSON just execute the following expressions: - -[[[ -Metacello new - baseline: 'XMLParserHTML'; - repository: 'github://pharo-contributions/XML-XMLParserHTML:1.6.x/src'; - load. -]]] - -[[[ -Metacello new - baseline: 'XPath'; - repository: 'github://pharo-contributions/XML-XPath:2.2.x/src'; - load. -]]] - -[[[ -Metacello new - repository: 'github://svenvc/NeoJSON/repository'; - baseline: 'NeoJSON'; - load. -]]] - -%[[[ -%Gofer it -% smalltalkhubUser: 'PharoExtras' project: 'XMLParserHTML'; -% configurationOf: 'XMLParserHTML'; -% loadStable. -%]]] - -%[[[ -%Gofer it -% smalltalkhubUser: 'PharoExtras' project: 'XPath'; -% configurationOf: 'XPath'; -% loadStable. -%]]] - -%[[[ -%Gofer it -% smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo'; -% configurationOf: 'NeoJSON'; -% loadStable. -%]]] - - - - -!!! Define the Problem -This tutorial is based on a real life problem. We need to consult a database published by the US Department of Agriculture, extract data for over 8000 food ingredients and their nutrient contents and output the results as a JSON file. The main list of ingredients can be found at the following url: *https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference* (as shown in Figure *@figfood*). - -Since the web site disapparead from the moment we wrote this tutorial, we suggest to try (not tested) the wayback archive of the web site. -*https://web.archive.org/web/20150324141455/http://ndb.nal.usda.gov/ndb/foods?format=&count=&max=35&sort=&fgcd=&manu=&lfacet=&qlookup=&offset=140&order=desc* - -In addition we archive some limited files. You can also find the HTML version of the file in the github repository of this book *https://github.com/SquareBracketAssociates/Booklet-Scraping/* under the folder resources (*https://github.com/SquareBracketAssociates/Booklet-Scraping/tree/master/resources*). - -+Food list.>file://figures/food.png|width=100|label=figfood+ - -This table shows the first 50 rows, each corresponding to an ingredient. The table shows the NDB number, description and food group for each ingredient. Clicking on the number or description leads to a detailed table for the ingredient. This table comes in two forms, basic details and full details, and the information we want is in the full details. The full detailed table for the first ingredient can be found at the url: -*https://ndb.nal.usda.gov/ndb/foods/show/1?format=Full* (as shown in Figure *@figfood2*). - - -+Food details - Salted Butter.>file://figures/food2.png|width=100|label=figfood2+ - - -There are two areas of information that need to be extracted from this detailed table: -- There is a row of special factors, in this case beginning with 'Carbohydrate Factor: 3.87'. This is to be extracted as a set of (name, value) pairs. The number of factors can vary; some ingredients do not have any. -- There is a table of data for various nutrients, which are arranged in groups - proximates, vitamins, lipids etc. The number of columns in the table varies from one ingredient to another, but in every case the first three columns are nutrient name, unit of measurement and quantity; we have to extract these columns for every listed nutrient. - -The requirement is to extract all this information for each ingredient, and then output it as a JSON file: -- NBD number, description and food group from the main list; -- Factor names and values from the detailed table; -- Nutrient details from the detailed table. - - -!!! First find the required data -To start, we have to find where the required data are to be found in the HTML file. The general rule about this is that there are no rules. Web site designers are concerned only with the visual effect of the page, and they can use any of the many tricks of HTML to produce the desired effects. We use the XML HTML parser to convert text into an XML tree (a tree whose nodes are XML objects). We then explore this tree to find the elements we want, and for each one we have to find signposts showing a route through the tree to uniquely locate the element, using a combination of XPath and Smalltalk programming as required. We may use the names or attributes of the HTML tags, each of which becomes an instance of XMLElement in the tree, or we may match against the text content of a node. - -First read in the table of ingredients (first 50 rows only) as in the url. - -[[[ -| ingredientsXML | -ingredientsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov/ndb/search/list?sort=ndb&ds=Standard+Reference'. -ingredientsXML inspect -]]] - - -You can execute the expression and inspect its result. You will obtain an inspector on the tree and you can navigate this tree as shown in Figure *@inspector1*. - -+Navigating the XML document inside the inspector.>file://figures/InspectorXML.png|width=100|label=inspector1+ - -Since you may want to work on files that you saved on your disc you can also parse a file and get an XML tree as follows: - -[[[ -| ingredientsXML | -ingredientsXML := (XMLHTMLParser onFileNamed: 'FoodsList.html') parseDocument. -]]] - -The simplest way to explore the tree is starting from the top, i.e. by opening up the ==== node, but this can be tedious and confusing; we often find that there are many levels of nested ==
== nodes before finding what we want. Alternatively, we can use XPath speculatively to look for interesting nodes. In the case of the foods list, we might guess that the list of ingredients will be stored in a ==== node. Having parsed the web page as shown above in a playground, we can then enter: -[[[ -ingredientsXML xPath: '//table' -]]] -and select 'do it and go', which shows an ==XMLNodeList== of all the table nodes - only one in this case. If there were several, we could use the attributes of the node or any of its ancestors to locate the one we want. We find by searching up several generations a node ==
== node, because there is only one ====. Now extract the text content of the four cells in each row; 'strings first' is a convenient way of finding the text in a node while ignoring any descendent nodes, and we routinely trim redundant spaces. -[[[ -ingredientCells := ingredientRows collect: - [:row| (row xPath: 'td') collect: - [ :cell| cell strings first trim]]. -]]] - -To prepare for export to JSON, it is handy to put the three required fields (ignoring the first) in a Dictionary indexed by their field names. Using an OrderedDictionary is not strictly necessary, but it does mean that the JSON output is easier for a human to understand. - -[[[ -ingredientsJSON := ingredientCells collect: - [ :row| { 'nbd_no' -> (row at: 2). - 'full-name' -> (row at: 3). - 'food-group' -> (row at: 4)} -asOrderedDictionary ]. -]]] - -If we 'do it and go' the next line, we can see the JSON layout. For this demo, we do not need to export to a JSON file; it is easier to look at it as text in the playground. - -[[[ -NeoJSONWriter toStringPretty: ingredientsJSON first. -]]] - -We can find the relative url address of the ingredient details from the href in the second cell. Because this is the address of the basic details table, we edit it to discard all the parameters, so that we can edit in the parameters for the full table. -[[[ -ingredientAddress := ingredientRows collect: - [ :row| (row xPath:'td[2]/a/@href') first value copyUpTo: $?]. -]]] - -Up to this point, we have been constructing lists with data for all 50 ingredients in the table. To show how to process the ingredient details, we just process the first ingredient in the file. The production version would have to run through all the rows in the ingredientAddress collection. We read and parse the detail file, after editing the url. - -[[[ -ingredientDetailsXML := XMLHTMLParser parseURL: 'https://ndb.nal.usda.gov', ingredientAddress first, '?format=Full'. -]]] - -The data for the factors are contained in ==== nodes within ==
== nodes. This does not identify them uniquely, so we extract all such nodes with XPath and then use ordinary Smalltalk to find the ones mentioning the word 'Factor'. - -[[[ -factorCells := (ingredientDetailsXML xPath: '//div[@class=''row'']//span') - collect: [:each| each strings first trim]. - -factors := OrderedCollection new. -1 to: factorCells size by: 2 do: [ :index| - ((factorCells at: index) matches: 'Factor') ifTrue: [factors addLast: - {'factor' -> (factorCells at: index). - 'amt' -> ((factorCells at: index + 1) trimRight:[:c|c asInteger = 160])} - asOrderedDictionary]]. -]]] - -Note: it appears that the web designers have used no-break space characters to control the formatting, and these are not removed by 'trim', so we use the 'trimRight:' clause above to remove them. - -The layout of the nutrients table is messy, presumably to achieve the effect of the inserted row with the nutrient group name. This means that we cannot locate the nutrient rows using ==
== nodes, as we did for the main list. Instead we have to get at all the individual table cells in ==== node. - -[[[ -nutrientCells := (ingredientDetailsXML xPath: '//table//td') collect: [:each|each strings first trim]. - -nutRowLength := (ingredientDetailsXML xPath: '//table/tbody/tr') first elements size. - -nutrients := OrderedCollection new. -1 to: nutrientCells size by: nutRowLength do: -[:index|nutrients addLast: - { 'group' -> (nutrientCells at: index). - 'nutrient' -> (nutrientCells at: index + 1). - 'unit' -> (nutrientCells at: index + 2). - 'per100g' -> (nutrientCells at: index + 3) } - asOrderedDictionary ]. -]]] - -Finally assemble all the information for the first ingredient as a JSON file. NeoJSON automatically takes care of embedding dictionaries within a collection within a dictionary. (See specimen in Figure *@jsonspec*) - -[[[ -NeoJSONWriter toStringPretty: - ((ingredientsJSON first) - at: 'factors' put: factors asArray; - at: 'nutrients' put: nutrients asArray; - yourself). -]]] - -+Sample of JSON output.>file://figures/JSON_Sample.png|width=100|label=jsonspec+ - -!!! Turning the pages -The code above will extract the data for one ingredient, and could obviously be repeated for all the 50 items in one page of data. However, the entire database contains 8789 ingredients at the time of writing, which amounts to 176 pages. The database seems to impose a limit of 50 ingredients per page, so to process the entire database we need to read the pages in succession. Each page contains a link which, if clicked, will load the next page. We can do this programmatically, by finding the link after processing the page. The link is contained in node ==
==, so we can use the code: - -[[[ -nextButtons := (ingredientsXML xPath: '//div[@class=''paginateButtons'']//a') - select:[:node| node strings first = 'Next']. - -nextURL := (nextButtons size > 0) - ifTrue:['https://ndb.nal.usda.gov', (nextButtons first attributeAt: 'href')] - ifFalse: [nil]. -]]] - -This is a common requirement in processing large databases on the web, and so we can use a standard pattern: - -[[[ - -nextURL := -[nextURL isNil] whileFalse: -[pageXML := XMLHTMLParser parseURL: nextURL. - - -] -]]] - -!!! Conclusion - -We have presented a way to extract information from a structured document. The methods used are of course particular to the layout of the USDA database, but the general principles should be clear. A mixture of XPath and Smalltalk can be used in order to locate the required data. - -One problem which can arise, if we need to repeat the extraction with updated data, is that the web designers can change the layout of the pages; this did in fact happen with the USDA table in the 15 months between originally tackling the problem and writing this article. The usual result is that the signposts no longer work, and the XPath results are empty. If the update is being run automatically, say on a daily basis, it may be worth while inserting defensive code in the processing, which will raise an exception if the results are not as expected. How to do this will depend on the particular application. diff --git a/Chapters/Scraping2.pillar b/Chapters/Scraping2.pillar deleted file mode 100644 index d47cc18..0000000 --- a/Chapters/Scraping2.pillar +++ /dev/null @@ -1,179 +0,0 @@ -!! Scraping Magic - -In this chapter we will scrap the web site of Magic the gathering and in particular the card database. (Yes I play Magic not super good but well I have fun). -Here is one example *http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430* as shown in Figure *@ligthouse2*. -Now we will try to show you how we explore the HTML page using the excellent Pharo inspector: diving in the tree nodes and checking live their attributes or children is simply super cool. - - -+http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430.>file://figures/arcane.png|width=80|label=ligthouse2+ - -!!! Getting a tree - -The first thing was to make sure that we can get a tree from the web page. For this task we used the ==XMLHTMLParser== class and sends it the message ==parseURL:==. How did we find this message... Simply looking on the class side methods of the class. -How did we find the class, well looking at the subclass of ==XMLDOMParser== because HTML is close to XML or the inverse :). - -[[[ -| tree | -tree := (XMLHTMLParser parseURL: 'http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430') -]]] - - - -!!! First the card visual - -First we would like to grab the card visual because this is fun and cool. When we open the card visual in a separate window we see that the url is *http://gatherer.wizards.com/Handlers/Image.ashx?multiverseid=389430&type=card*. Therefore we started to look for Handlers in the nodes as shown in Figure *@image0*. - - -+Exploring images.>file://figures/magic1.png|width=80|label=image0+ - -[[[ -| tree | -tree := (XMLHTMLParser parseURL: 'http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430'). -tree xpath: '//img' -]]] - - - -!!!! No so cool but working... - -Toying with the inspector, we come up with the following ugly expression to get the name of the JPEG - -[[[testcase=true -| tree | -tree := (XMLHTMLParser parseURL: 'http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430'). -((tree xpath: '//img') third @ 'src') first value allButFirst: 5 ->>> 'Handlers/Image.ashx?multiverseid=389430&type=card' -]]] - -Ugly isn't it? This happens often when scraping HTML, but we can do better. -By the way note also that we start to enter directly XPath command using the XPath pane and using the doit and go facilities of the inspector. -This way we do not have to get the page from internet all the time. - - -!!! Revisiting it - -We could not really show you such ugly expressions so we had to find a better one. - -So first we look at the img that has src as atttribute as shown below and in Figure *@image1*. -[[[ -| tree | -tree := (XMLHTMLParser parseURL: 'http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430'). -(tree xpath: '//img[@src]') -]]] - -+Exploring images.>file://figures/magic2.png|width=80|label=image1+ - -Then as shown in Figure *@image2* we inspected the right node. - -+Narrowing the node.>file://figures/magic3.png|width=80|label=image2+ - -Finally since we were on this exact node, we looked in its class to see if we could get an API to get the attribute in a nice way as shown in Figure *@image3*. - -+Exploring the class API on the spot: looking to see if there is a attribute something method.>file://figures/magic4.png|width=80|label=image3+ - -[[[ -| tree | -tree := (XMLHTMLParser parseURL: 'http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430'). -(tree xpath: '//img[@src]') third attributeAt: 'src' -]]] - -Now that we have the visual path, we can use the HTTP client of Pharo to get the image as shown in Figure *@zinc*. - -[[[ -| tree path | -tree := (XMLHTMLParser parseURL: 'http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430'). -path := ((tree xpath: '//img[@src]') third attributeAt: 'src') allButFirst: 5. -(ZnEasy getJpeg: 'http://gatherer.wizards.com/',path) asMorph openInWorld -]]] - -+Getting the card visual inside Pharo.>file://figures/magic5.png|width=100|label=zinc+ - -!!! Getting data - -Since this web page is probably generated, we look for example for the artist string in the source and we found the following matches: - -[[[ -ClientIDs.artistRow = 'ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_artistRow'; -]]] - -This one is more interesting: - -[[[ -
-
- Artist:
-
-]]] - -We can build queries to identify node elements having this id. -To avoid to perform an internet request each time, we typed directly XPath path in the XPath pane of the inspector as shown in Figure *@row*. -Now trying to get faster we looked at all the class="row" as shown in Figure *@row*. - -[[[ -//div[@class='row'] -]]] - -+Getting the card information.>file://figures/magic6.png|width=80|label=row+ - -The following expression returns the pair label and value for example for the card name label and its value. - -[[[ -//div[@class='row']/div[@class='label']| //div[@class='row']/div[@class='value'] -]]] - -So we can now query all the fields - -[[[testcase=true -| tree | -tree := (XMLHTMLParser parseURL: 'http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430'). -container := tree xpath: '//div[@class=''row'']/div[@class=''label'']| //div[@class=''row'']/div[@class=''value'']'. -container collect: [ :each | each contentString trimBoth ]. ->>> a XMLOrderedList('Card Name:' 'Arcane Lighthouse' 'Types:' 'Land' 'Card Text:' -': Add to your mana pool. , : Until end of turn, creatures your opponents control -lose hexproof and shroud and can''t have hexproof or shroud.' -'Expansion:' 'Commander 2014' 'Rarity:' 'Uncommon' 'Card Number:' '59' 'Artist:' 'Igor Kieryluk') -]]] - -Now we can convert this into a dictionary - -[[[ -| tree | -tree := (XMLHTMLParser parseURL: 'http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430'). -container := tree xpath: '//div[@class=''row'']/div[@class=''label'']| //div[@class=''row'']/div[@class=''value'']'. -((container collect: [ :each | each contentString trimBoth ]) - asOrderedCollection groupsOf: 2 atATimeCollect: [ :x :y | x -> y]) asDictionary -]]] - - -And convert it into JSON for fun - -[[[testcase=true -| tree dict | -tree := (XMLHTMLParser parseURL: 'http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430'). -container := tree xpath: '//div[@class=''row'']/div[@class=''label'']| //div[@class=''row'']/div[@class=''value'']'. -dict := ((container collect: [ :each | each contentString trimBoth ]) - asOrderedCollection groupsOf: 2 atATimeCollect: [ :x :y | x -> y]) asDictionary. - -NeoJSONWriter toStringPretty:dict ->>> - -'{ - "Card Number:" : "59", - "Card Name:" : "Arcane Lighthouse", - "Artist:" : "Igor Kieryluk", - "Types:" : "Land", - "Card Text:" : ": Add to your mana pool. , : Until end of turn, creatures your opponents control lose - hexproof and shroud and can''t have hexproof or shroud.", - "Expansion:" : "Commander 2014", - "Rarity:" : "Uncommon" -}' -]]] - -Now we can apply the same technique to access all the cards and also different pages to extract all the card unique id and query the database. -But this is left as an exercise. - -!!! Conclusion - -We show you how we could access the page and navigate interactively through it using XPath and live programming feature of Pharo. -This chapter should show the great value to be able to tweak you live a document and navigate to find the information you really want. diff --git a/Chapters/XPath.pillar b/Chapters/XPath.pillar deleted file mode 100644 index 085407c..0000000 --- a/Chapters/XPath.pillar +++ /dev/null @@ -1,502 +0,0 @@ -!! Little Journey into XPath - -XPath is the de factor standard language for navigating an XML document and selecting nodes from it. XPath expressions act as queries that identifies nodes. In this chapter we will go through the main concepts and show some of the ways we can access nodes in a xml document. All the expressions can be executed on the spot, so do not hesitate to experiment with them. - -!!! Getting started - -You should load the XML parser and XPath library as follows: -[[[ -Metacello new - baseline: 'XMLParserHTML'; - repository: 'github://pharo-contributions/XML-XMLParserHTML:1.6.x/src'; - load. -]]] - -[[[ -Metacello new - baseline: 'XPath'; - repository: 'github://pharo-contributions/XML-XPath:2.2.x/src'; - load. -]]] - - -!!! An example - -As an example we will take the possible representation of Magic cards, starting with the - Arcane Lighthouse that you can view at *http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430* -and is shown in Figure *@ligthouse*. - -+http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=389430.>file://figures/lighthouse.png|width=100|label=ligthouse+ - -[[[ - - - - - Arcane Lighthouse - Land - 2014 - Uncommon - Commander 2014 - Tap: Add 1 uncolor to you mana pool. - 1 uncolor + Tap: Until end of turn, creatures your opponents - control lose hexproof and shroud and can't have - hexproof or shroud. - - -]]] - -!!! Creating a tree of objects - -In Pharo it is always powerful to get an object and interact with it. -So let us do that now using the ==XMLDOMParser== to convert our data in a tree of objects (as shown in Figure *@inspectorx*). -Note that the escaped the =='== with an extra quote as in ==can\'\'t==. - -[[[ - | tree | - tree := (XMLDOMParser on: - ' - - - - Arcane Lighthouse - Land - 2014 - Uncommon - Commander 2014 - Tap: Add 1 uncolor to you mana pool. - 1 uncolor + Tap: Until end of turn, creatures your opponents - control lose hexproof and shroud and can''t have - hexproof or shroud. - - ') parseDocument -]]] - - -+Grabbing and playing with a tree.>file://figures/xpath1.png|width=100|label=inspectorx+ - -!!! Nodes, node sets and atomic values -We will be working with three kinds of XPath constructs: nodes, node sets, and atomic values. - -Node sets are sets (duplicate-free collections) of nodes. All node sets produced by XPath location path expressions are sorted in document order, the order in the document source that they appear in. - -The following elements are nodes: -[[[ - (root element node) - -Arcane Lighthouse (element node) - -lang="en" (attribute node) -]]] - -Atomic values are strings, numbers, and booleans. Here are some examples of atomic values: - -[[[ -Arcane Lighthouse - -"en" - -2.5 - --1 - -true - -false -]]] - - -!!! Basic tree relationships - -Since we are talking about trees, nodes can have multiple relationships with each other: parent, child and siblings. -Let us set some simple vocabulary. - -- ""Parent."" Each node can have at most one parent. The root node of the tree, usually a document, has no parent. In the Arcane Lighthouse example, the card element is the parent of the cardname, types, year, rarity, expansion and cardtext elements. In XPath, attribute and namespace nodes treat the element they belong to as their parent. - -- ""Children."" Document and element nodes may have zero, one or more children, which can be elements, text nodes, comments or processing instructions. The cardname, types, year, rarity, expansion and cardtext elements are all children of the card element. Confusingly, even though attribute and namespace nodes can have element parents in XPath, they aren't children of their parent elements. - -- ""Siblings."" Siblings are child nodes that have the same parent. The cardname, types, year, rarity, expansion and cardtext elements are all siblings. Attributes and and namespace nodes have no siblings. - -- ""Ancestors."" A node's parent, parent's parent, etc. Ancestors of the cardname element are the card element and the cardset nodes. - -- ""Descendants"" A node's children, children's children, etc. Descendants of the cardset element are the card,cardname, types, year, rarity, expansion and cardtext elements. - - -!!! A large example - -Let us expand our example to have cover more cases. - -[[[ - - | tree | - tree := (XMLDOMParser on: - ' - - - - Arcane Lighthouse - Land - 2014 - Uncommon - Commander 2014 - Tap: Add 1 uncolor to you mana pool. - 1 uncolor + Tap: Until end of turn, creatures your opponents - control lose hexproof and shroud and can''t have - hexproof or shroud. - - - Desolate Lighthouse - Land - 2013 - Rare - Avacyn Restored - Tap: Add Colorless to your mana pool. - 1BlueRed, Tap: Draw a card, then discard a card. - - ') parseDocument -]]] - -+Select the raw tab and click on self in the inspector.>file://figures/xpath2.png|width=100|label=inspector2+ - -Select the raw tab and click on self in the inspector (as shown in Figure *@inspector2*). Now we are ready to learn XPath. - -!!! Node selection - -The following table shows the XPath expressions. Often the current node is also named the context. - -| ""Expression"" |""Description""| -| nodename | Selects all child nodes with the name "nodename" | -|/ |Selects the root node| -|// | Selects any node from the current node that | -| | matches the context selection| -| . | Selects the context (current) node | -|..|Selects the parent of the context (current) node | -|@ |Selects attributes of the context node | - - -In the following we expect that the variable ==tree== is bound the full document tree we previously created parsing the XML string. -Location path expressions return node sets, which are empty if no nodes match. Now let us play with the system to really see how it works. - -!!!! Node tag name selection - -There are several way to test and select nodes. - -| ""nodename"" | Selects all child nodes with the name "nodename" | -| card | Selects all child nodes with the name "card" | -| ""prefix:localName"" | Selects all child nodes with the qualified | -| | name "prefix:localName" or if at least one prefix -| | or namespace URI pair was declared in the | -| | XPathContext, the child nodes with the local name | -| | "localName" and the namespace URI bound to "prefix"| - -In standard XPath, qualified name tests like prefix:localName select nodes with the same local name and the namespace URI of the prefix, which must be declared in the controlling XPath context prior to evaluation. The selected nodes from the document can have different prefixes (or none at all), because matching is based on local name and namespace URI. - -To simplify things, the Pharo XPath library (unlike others) by default matches qualified name tests against the literal qualified names of nodes, ignoring namespace URIs completely, and does not require you to pre-declare namespace prefix/URI pairs in the XPathContext object before evaluation. Declaring at least one namespace prefix/URI pair will trigger standard behavior, where all prefixes used in qualified name tests must be pre-declared, and matching will be done based on local names and namespace URIs. - -!!!! Context and parent - -| . | Selects the current context node | -|..|Selects the parent of the current context node | - -The following expression shows that ==.== (period) selects the context node, initially the node XPath evaluation begins in. - -[[[testcase=true -(tree xpath: '.') first == tree ->>> true -]]] - - - -!!!! Matching path-based child nodes - -The operator ==/== selects from the root node. - -| ""/"" | ""Selects from the root node""| -| /cardset | Selects the root element cardset | -| cardset/card | Selects all the card grandchildren| -| | from the cardset children of the context node | - -The following expression selects all the card nodes under cardset node. - -[[[ -path := XPath for: '/cardset/card'. -path in: tree. -]]] - -==XPath== objects lazily compile their source to an executable form the first time they're evaluated, and the compiled form and its source are cached globally, so caching the ==XPath== object itself in a variable is normally unecessary to avoid recompilation and is only slightly faster. The previous expression is equivalent to the following expression using the ==xpath:== message. - -[[[ -tree xpath: '/cardset/card' -]]] - - - -!!!! Matching deep nodes - -The ==//== operation selects all the nodes matching the selection. - - -| ""//"" | Selects from the context (current) node and all descendants | -| //year |Selects all year node children of the context node and | -| | of its descendants | -| cardset//year | Selects all year node children of the cardset context | -| | node children and their descendants | - -Let us try with another element such as the expansion of a card. -[[[testcase=true -tree xpath: '//expansion' ->>> -a XPathNodeSet(Commander 2014 Avacyn Restored) -]]] - -The XPath library extends ==XMLNode== classes with binary selectors to encode certain XPath expressions directly in Pharo. So the previous expression can be expressed as follows using the message ==//==: - -[[[testcase=true -tree // 'expansion' ->>> -a XPathNodeSet(Commander 2014 Avacyn Restored) -]]] - - -!!!! Identifying attributes - -==@== matches attributes. - -| ""Expression"" |""Description""| -|@ |Selects attributes| -|//@lang | Selects all attributes that are named lang| - - - -The following expression returns all the attributes whose name is ==lang==. -[[[testcase=true -(tree xpath: '//@lang') ->>> a XPathNodeSet(lang=""en"" lang=""en"") -]]] - - - -!!! Predicates - -Predicates are used to find a specific node or a node that contains a specific value. Predicates are always embedded in square brackets. - -Let us study some examples: - -!!!! First element - -The following expression selects the first card child of the cardset element. -[[[testcase=true -tree xpath: '/cardset/card[1]' ->>> -a XPathNodeSet( - Arcane Lighthouse - Land - 2014 - Uncommon - Commander 2014 - Tap: Add 1 uncolor to you mana pool. - 1 uncolor + Tap: Until end of turn, creatures your opponents - control lose hexproof and shroud and can't have - hexproof or shroud. - ) -]]] - -In the XPath Pharo implementation, the message ==??== can be used for position or block predicates. - -the previous expression is equivalent to the following one - -[[[ -tree / 'cardset' / ('card' ?? 1) . -]]] - - -Block or position predicates can be applied with ==??== to axis node test arguments or to result node sets. - -The following expression returns the first element of each 'card' descendant: - -[[[testcase=true -tree // 'card' / ('*' ?? 1) ->>> "a XPathNodeSet(Arcane Lighthouse Desolate Lighthouse)" -]]] - -!!!! Other position functions - -The following expression selects the last card node that is the child of the cardset node. - -[[[ -tree xpath: '/cardset/card[last()]'. -]]] - -The following selects the second to last node. In our case since we only have two elements we get the first. - -[[[testcase=true -tree xpath: '/cardset/card[last()-1]'. ->>> -a XPathNodeSet( - Arcane Lighthouse - Land - 2014 - Uncommon - Commander 2014 - Tap: Add 1 uncolor to you mana pool. - 1 uncolor + Tap: Until end of turn, creatures your opponents - control lose hexproof and shroud and can't have - hexproof or shroud. - ) -]]] - -We can also use the position function and use it to identify nodes. The following selects the first two card nodes that are children of the cardset node. - -[[[testcase=true -(tree xpath: '/cardset/card[position()<3]') size = 2 ->>> true -]]] - - -!!!! Selecting based on node value - -In addition we can select nodes based on a value of a node. The following query selects all the card nodes (of the cardset) that have a year greater than 2014. - -[[[ -tree xpath: '/cardset/card[year>2013]'. -]]] - -The following query selects all the cardname nodes of the card children of cardset that have a year greater than 2014. - -[[[testcase=true -/cardset/card[year>2013]/cardname ->>> a XPathNodeSet(Arcane Lighthouse) -]]] - -!!!! Selecting nodes based on attribute value - -We can also select nodes based on the existence or value of an attribute. -The following expression returns the cardname that have the lang attribute and whose value is 'en'. -[[[testcase=true -tree xpath: '//cardname[@lang] ->>> a XPathNodeSet(Arcane Lighthouse Desolate Lighthouse) -tree xpath: '//cardname[@lang='en'] -]]] - -Note that we can simply get the card from the name using '..'. - -[[[testcase=true -tree xpath: '//cardname[@lang='en']/.. ->>> -]]] - -!!! Selecting Unknown Nodes - -In addition we can use wildcard to select any node. - -| ""Wildcard""| ""Description"" -|* |Matches any element node| -|@* |Matches any attribute node| -|node() | Matches any node of any kind| - -For example ==//*== selects all elements in a document. - -[[[testcase=true -(tree xpath: '//*') size ->>> 15 -]]] - -While ==//@*== selects all the attributes of any node. - -[[[testcase=true -tree xpath: '//@*' ->>> a XPathNodeSet(lang=""en"" lang=""en"") - -]]] - -For example ==//cardname[@*]== selects all cardname elements which have at least one attribute of any kind. -[[[testcase=true -tree xpath: '//cardname[@*]' ->>> a XPathNodeSet(Arcane Lighthouse Desolate Lighthouse) -]]] -The following expression selects all child nodes of cardset. -[[[ -tree xpath: '/cardset/*'. -]]] - -The following expression selects all the cardname of all the child nodes of cardset. -[[[ -tree xpath: '/cardset/*/cardname'. -]]] - -!!! Handling multiple queries - -By using the ==|== union operator in an XPath expression you can select several paths. -The following expression selects both the cardname and year of card nodes located anywhere in the document. - -[[[testcase=true -tree xpath: '//card/cardname | //card//year' ->>> a XPathNodeSet(Arcane Lighthouse 2014 -Desolate Lighthouse 2013)" -]]] - -!!! XPath axes - -XPath introduces another way to select nodes using ''location step'' following the syntax: ==axisname::nodetest[predicate]==. -Such expressions can be used in the steps of location paths (see below). - -An axis defines a node-set relative to the context (current) node. Here is a table of the available axes. -Except for the namespace axis, all of these have binary selector equivalents. - -|""AxisName""| ""Result"" -|ancestor | Selects all context (current) node ancestors -|ancestor-or-self |... and the context node itself -|attribute |Selects all context (current) node attributes -|child |Selects all context (current) node children -|descendant | Selects all context node descendants -|descendant-or-self |... and the context node itself -|following |Selects everything after the context node closing tag -|following-sibling |Selects all siblings after the context node -|namespace | Selects all context node namespace nodes -|parent | Selects context node parent -|preceding |Selects all nodes that appear before the context node -| | except ancestors, attribute nodes and namespace nodes -|preceding-sibling | Selects all siblings before the context node -|self | Selects the context node - - -!!!! Paths - -A location path can be absolute or relative. An absolute location path starts with a slash ( / ) (/step/step/...) and a relative location path does not (step/step/...). In both cases the location path consists of one or more location steps, each separated by a slash. - -Each step is evaluated against the nodes in the context node-set. -A location step, ==axisname::nodetest[predicate]==, consists of: -- an axis (defines the tree-relationship between the selected nodes and the context node) -- a node-test (identifies a node within an axis) -- zero or more predicates (to further refine the selected node-set) - -The following example access the year node of all the children of the cardset. -[[[testcase=true -tree xpath: '/cardset/child::node()/year'). ->>>a XPathNodeSet(2014 2013) -]]] - -The following expression gets the ancestor of the year node and selects the cardname. -[[[testcase=true -(tree xpath: '/cardset/card/year') first xpath: 'ancestor::card/cardname' ->>> "a XPathNodeSet(Arcane Lighthouse)" -]]] - -The previous expression could be rewritten using a position predicate. Parentheses are needed so the predicate applies to the entire node set produced by the absolute location path, rather than just the last step, otherwise it would select the first year of each card, instead of the first year overall: -[[[testcase=true -(tree xpath: '(/cardset/card/year)[1]/ancestor::card/cardname' ->>> "a XPathNodeSet(Arcane Lighthouse)" -]]] - - - - - - -!!! Conclusion - -XPath is a powerful language. The Pharo XPath library developed and maintained by Monty van OS and the Pharo Extras Team implements the full standard 1.0. Coupled with the live programming capabilities of Pharo, it gives a really powerful way to explore structured XML data. - - - - diff --git a/index.pillar b/index.md similarity index 83% rename from index.pillar rename to index.md index 310cbaf..a7873d0 100644 --- a/index.pillar +++ b/index.md @@ -13,10 +13,10 @@ We revised a new version for Pharo 70 and 80. The libraries are now hosted on gi Special thank to Torsten Bergman for the migration of the libraries on github. A final point, the website originally used does not exist anymore. At that time we archived some files under -*https://github.com/SquareBracketAssociates/Booklet-Scraping/tree/master/resources* +[https://github.com/SquareBracketAssociates/Booklet-Scraping/tree/master/resources]() Now you may try to use the web archive. Here is a reference that seems to work. -*https://web.archive.org/web/20150324141455/http://ndb.nal.usda.gov/ndb/foods?format=&count=&max=35&sort=&fgcd=&manu=&lfacet=&qlookup=&offset=140&order=desc*. +[https://web.archive.org/web/20150324141455/http://ndb.nal.usda.gov/ndb/foods?format=&count=&max=35&sort=&fgcd=&manu=&lfacet=&qlookup=&offset=140&order=desc](). We are sorry to see all our efforts impacted by such changes but we cannot do magic. @@ -24,6 +24,6 @@ We are sorry to see all our efforts impacted by such changes but we cannot do ma Stef -${inputFile:path=Chapters/XPath.pillar}$ -${inputFile:path=Chapters/Scraping.pillar}$ -${inputFile:path=Chapters/Scraping2.pillar}$ + + + diff --git a/pillar.conf b/pillar.conf index 147c084..1d01463 100644 --- a/pillar.conf +++ b/pillar.conf @@ -5,7 +5,7 @@ "series": "The Pharo Technology Booklet Collection — edited by S. Ducasse", "keywords": "HTML, XML, Scrapping, XPath", "newLine": #unix, - "tocFile":"scrapingbook.pillar", - "latexWriter": #'latex:sbabook', + "tocFile":"index.md", + "latexWriter": #'miclatex:sbabook', "htmlWriter": #html } \ No newline at end of file
== nodes, and then count them in groups equal to the row length. Since the row length is not a constant, we have to determine it by examining the data for one row that *is* in a ==