Skip to content
Steve edited this page Sep 21, 2017 · 38 revisions

Here are some examples of how wptools is being used.

Get an article extract

The get_query() method gets (light) HTML and (Markdown) text extracts.

>>> page = wptools.page('Ella Fitzgerald')

>>> page.get_query()
en.wikipedia.org (query) Ella Fitzgerald
en.wikipedia.org (imageinfo) File:Ella Fitzgerald (Gottlieb 02871...
Ella Fitzgerald (en) data
{
  extext: <str(2002)> **Ella Jane Fitzgerald** (April 25, 1917J...
  extract: <str(2067)> <p><b>Ella Jane Fitzgerald</b> (April 25, 1...
  ...
}

Compare to RESTBase extracts:

>>> page.get_restbase('summary')
en.wikipedia.org (restbase) /page/summary/Ella Fitzgerald
Ella Fitzgerald (en) data
{
  exhtml: <str(1455)> <p><b>Ella Jane Fitzgerald</b> (April 25, 19...
  exrest: <str(1424)> Ella Jane Fitzgerald (April 25, 1917June ...
  ...
}

Get a representative image

A representative image for a page can come from the Wikimedia:API, from an Infobox, from Wikidata Property:P18, or from the RESTBase. See the Images wiki page for details.

>>> page = wptools.page('Frida Kahlo')

>>> page.get_query()
en.wikipedia.org (query) Frida Kahlo
en.wikipedia.org (imageinfo) File:Frida Kahlo, by Guillermo Kahlo...
Frida Kahlo (en) data
{
  image: <list(2)> {'kind': 'query-pageimage', u'descriptionshortu...
  ...
}

>>> page.pageimage()
['query-pageimage', 'query-thumbnail']

>>> page.pageimage('page')['url']
u'https://upload.wikimedia.org/wikipedia/commons/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'

>>> page.pageimage('thumb')['url']
u'https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg/160px-Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'

!Frida Kahlo

Get page HTML

The most performant way to get article HTML is via RESTBase.

>>> page = wptools.page('Buddha')

>>> page.get_restbase('html')
en.wikipedia.org (restbase) /page/html/Buddha
Buddha (en) data
{
  html: <str(628054)> <!DOCTYPE html><html prefix="dc: http://purl...
}

Get Infobox data

Getting data from Infoboxes may be unavoidable, but getting Wikidata (via get_wikidata()) is preferred. Wikidata is structured but sometimes data poor, while Infoboxen are unstructured and frequently data rich. Please consider updating Wikidata if the information you want is only available in a MediaWiki instance so that others may benefit from open linked data.

>>> page = wptools.page('Fela Kuti')

>>> page.get_parse()
en.wikipedia.org (parse) Fela Kuti
en.wikipedia.org (imageinfo) File:Fela Kuti.jpg
Fela Kuti (en) data
{
  infobox: <dict(17)> website, associated_acts, death_place, image...
  ...
}

>>> page.data['infobox']['instrument']
'Saxophone, vocals, keyboards, trumpet, guitar, drums'

Get cover images

Most media (album, book, film, etc.) cover images on Wikipedia appear in an Infobox. For convenience, we put "cover" files from infoboxes in the image attribute.

>>> page = wptools.page('Blue Train (album)')

>>> page.get_parse()
en.wikipedia.org (parse) Blue Train (album)
en.wikipedia.org (imageinfo) File:John Coltrane - Blue Train.jpg
Blue Train (album) (en) data
{
  image: <list(1)> {'kind': 'parse-cover', u'descriptionshorturl':...
  infobox: <dict(16)> Name, Language, Artist, Cover, Recorded, Lab...
  ...
}

>>> page.pageimage()
['parse-cover']

>>> page.pageimage('cover')['url']
u'https://upload.wikimedia.org/wikipedia/en/6/68/John_Coltrane_-_Blue_Train.jpg'

!Blue Train

Get wikidata

Resolved properties and claims are stored in the wikidata attribute. Wikidata properties are selected by wptools.wikidata.LABELS. Properties (e.g. P17 "country") are stored in properties, and those properties that have Wikidata items for values (e.g. Q142 "France") are stored in claims and resolved by another Wikidata API call. See the Wikidata page in our wiki for more details.

>>> page = wptools.page('Stephen Fry')

>>> page.get_wikidata()
www.wikidata.org (wikidata) Stephen Fry
www.wikidata.org (claims) Q8817795|Q5|Q7066|Q145
en.wikipedia.org (imageinfo) File:Stephen Fry cropped.jpg
Stephen Fry (en) data
{
  aliases: <list(1)> Stephen John Fry
  claims: <dict(4)> Q8817795, Q5, Q7066, Q145
  description: English comedian, actor, writer, presenter, and activist
  image: <list(1)> {'kind': 'wikidata-image', u'descriptionshortur...
  label: Stephen Fry
  modified: <dict(1)> wikidata
  pageid: 191035
  properties: <dict(8)> P135, P345, P910, P27, P856, P569, P18, P31
  title: Stephen_Fry
  what: human
  wikibase: Q192912
  wikidata: <dict(8)> website, category, citizenship, image, insta...
  wikidata_url: https://www.wikidata.org/wiki/Q192912
}

Extend Wikidata claims

If the predefined wptools.wikidata.LABELS do not include something you want resolved from a claim, you can simply add your property labels via update_labels():

>>> page = wptools.page('Simone de Beauvoir')

>>> page.update_labels({'P21': 'gender'})

>>> page.get_wikidata()
www.wikidata.org (wikidata) Simone de Beauvoir
www.wikidata.org (claims) Q142|Q5|Q3411417|Q859773|Q38066|Q151578...
en.wikipedia.org (imageinfo) File:Simone de Beauvoir.jpg
Simone de Beauvoir (en) data
{
  wikidata: <dict(10)> category, death, citizenship, gender, image...
  ...
}

In [29]: page.data['wikidata']['gender']
Out[29]: u'female'

Get all the page info

Simply calling get() on a page will automagically fetch extracts, images, infobox data, wikidata, and other metadata via the MediaWiki, Wikidata, and RESTBase APIs.

>>> page = wptools.page('Gandhi').get()
en.wikipedia.org (query) Gandhi
en.wikipedia.org (parse) 19379
www.wikidata.org (wikidata) Q1001
www.wikidata.org (claims) Q6581097|Q5|Q129286|Q6512732|Q668
en.wikipedia.org (restbase) /page/summary/Mahatma_Gandhi
en.wikipedia.org (imageinfo) File:Portrait Gandhi.jpg|File:MKGandhi.jpg
Mahatma Gandhi (en) data
{
  aliases: <list(10)> M K Gandhi, Mohandas Gandhi, Bapu, Gandhi, M...
  claims: <dict(5)> Q6581097, Q5, Q129286, Q6512732, Q668
  description: <str(67)> pre-eminent leader of Indian nationalism ...
  exhtml: <str(1064)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b>...
  exrest: <str(896)> Mahātmā Mohandas Karamchand Gandhi (; Hindust...
  extext: <str(2985)> Mahātmā **Mohandas Karamchand Gandhi** (; Hi...
  extract: <str(3212)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b...
  image: <list(6)> {'kind': 'query-pageimage', u'descriptionshortu...
  infobox: <dict(25)> known_for, other_names, image, signature, bi...
  label: Mahatma Gandhi
  length: 264,127
  links: <list(10)> https://biblio.wiki/wiki/Mohandas_K._Gandhi, h...
  modified: <dict(2)> wikidata, page
  pageid: 19379
  parsetree: <str(333405)> <root><template><title>Redirect</title>...
  properties: <dict(8)> P345, P910, P27, P21, P569, P18, P31, P570
  random: Pukara (Moquegua)
  title: Mahatma_Gandhi
  url: https://en.wikipedia.org/wiki/Mahatma_Gandhi
  url_raw: https://en.wikipedia.org/wiki/Mahatma_Gandhi?action=raw
  watchers: 1,733
  what: human
  wikibase: Q1001
  wikidata: <dict(8)> category, death, citizenship, gender, image,...
  wikidata_url: https://www.wikidata.org/wiki/Q1001
  wikitext: <str(262663)> {{Redirect|Gandhi}}{{pp-move-indef}}{{pp...
}

You can also call get_more() to get further page data that results in a more expensive (slower) query, like page files, categories, languages, contributors, and average daily views:

>>> page.get_more()
en.wikipedia.org (querymore) Gandhi
Mahatma Gandhi (en) data
{
  categories: <list(67)> Category:1869 births, Category:1948 death...
  contributors: 2,608
  files: <list(52)> File:Aum Om red.svg, File:Commons-logo.svg, Fi...
  languages: <list(167)> {u'lang': u'af', u'title': u'Mahatma Gand...
  modified: <dict(1)> page
  pageid: 19379
  title: Mahatma Gandhi
  views: 21,603
}
Clone this wiki locally