-
Notifications
You must be signed in to change notification settings - Fork 78
Examples
- Get an article extract
- Get a representative image
- Get page HTML
- Get Infobox data
- Get cover images
- Get Wikidata
- Minimize Wikidata requests
- Get all the page info
- Get category members
- Get site info
- List most popular articles
The get_query()
method gets (light) HTML and (Markdown) text extracts.
>>> page = wptools.page('Ella Fitzgerald')
>>> page.get_query()
en.wikipedia.org (query) Ella Fitzgerald
en.wikipedia.org (imageinfo) File:Ella Fitzgerald (Gottlieb 02871...
Ella Fitzgerald (en) data
{
extext: <str(2002)> **Ella Jane Fitzgerald** (April 25, 1917 – J...
extract: <str(2067)> <p><b>Ella Jane Fitzgerald</b> (April 25, 1...
...
}
Compare to RESTBase extracts:
>>> page.get_restbase('summary')
en.wikipedia.org (restbase) /page/summary/Ella Fitzgerald
Ella Fitzgerald (en) data
{
exhtml: <str(1455)> <p><b>Ella Jane Fitzgerald</b> (April 25, 19...
exrest: <str(1424)> Ella Jane Fitzgerald (April 25, 1917 – June ...
...
}
A representative image for a page can come from the Wikimedia:API, from an Infobox, from Wikidata Property:P18, or from the RESTBase. See the Images wiki page for details.
>>> page = wptools.page('Frida Kahlo')
>>> page.get_query()
en.wikipedia.org (query) Frida Kahlo
en.wikipedia.org (imageinfo) File:Frida Kahlo, by Guillermo Kahlo...
Frida Kahlo (en) data
{
image: <list(2)> {'kind': 'query-pageimage', u'descriptionshortu...
...
}
>>> page.pageimage()
['query-pageimage', 'query-thumbnail']
>>> page.pageimage('page')['url']
u'https://upload.wikimedia.org/wikipedia/commons/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'
>>> page.pageimage('thumb')['url']
u'https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg/160px-Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'
The most performant way to get article HTML is via RESTBase.
>>> page = wptools.page('Buddha')
>>> page.get_restbase('html')
en.wikipedia.org (restbase) /page/html/Buddha
Buddha (en) data
{
html: <str(628054)> <!DOCTYPE html><html prefix="dc: http://purl...
}
Getting data from Infoboxes may be unavoidable, but getting Wikidata (via get_wikidata()
) is preferred. Wikidata is structured but sometimes data poor, while Infoboxen are unstructured and frequently data rich. Please consider updating Wikidata if the information you want is only available in a MediaWiki instance so that others may benefit from open linked data.
>>> page = wptools.page('Fela Kuti')
>>> page.get_parse()
en.wikipedia.org (parse) Fela Kuti
en.wikipedia.org (imageinfo) File:Fela Kuti.jpg
Fela Kuti (en) data
{
infobox: <dict(17)> website, associated_acts, death_place, image...
...
}
>>> page.data['infobox']['instrument']
'Saxophone, vocals, keyboards, trumpet, guitar, drums'
Most media (album, book, film, etc.) cover images on Wikipedia appear in an Infobox. For convenience, we put "cover" files from infoboxes in the image
attribute.
>>> page = wptools.page('Blue Train (album)')
>>> page.get_parse()
en.wikipedia.org (parse) Blue Train (album)
en.wikipedia.org (imageinfo) File:John Coltrane - Blue Train.jpg
Blue Train (album) (en) data
{
image: <list(1)> {'kind': 'parse-cover', u'descriptionshorturl':...
infobox: <dict(16)> Name, Language, Artist, Cover, Recorded, Lab...
...
}
>>> page.pageimage()
['parse-cover']
>>> page.pageimage('cover')['url']
u'https://upload.wikimedia.org/wikipedia/en/6/68/John_Coltrane_-_Blue_Train.jpg'
We put Wikidata page claims in data['claims']
. We fetch entity labels into data['labels']
and put it all together in data['wikidata']
. See the Wikidata page in our wiki for more details.
>>> page = wptools.page('Stephen Fry')
>>> page.get_wikidata()
www.wikidata.org (wikidata) Stephen Fry
www.wikidata.org (labels) P1220|Q6625963|P2387|P434|Q1860|P2469|P...
www.wikidata.org (labels) P106|P268|P269|P27|P26|P21|Q4927100|P86...
www.wikidata.org (labels) P1050|P1969|Q765642
en.wikipedia.org (imageinfo) File:Stephen Fry cropped.jpg
Stephen Fry (en) data
{
aliases: <list(1)> Stephen John Fry
claims: <dict(74)> P646, P1220, P2387, P434, P648, P3192, P1050,...
description: English comedian, actor, writer, presenter, and activist
image: <list(1)> {'kind': 'wikidata-image', u'descriptionshortur...
label: Stephen Fry
labels: <dict(103)> P1220, Q6625963, P2387, P434, Q1860, P2469, ...
modified: <dict(1)> wikidata
pageid: 191035
requests: <list(5)> wikidata, labels, labels, labels, imageinfo
title: Stephen_Fry
what: human
wikibase: Q192912
wikidata: <dict(74)> Tumblr ID (P3943), MovieMeter director ID (...
wikidata_url: https://www.wikidata.org/wiki/Q192912
}
>>> page.data['wikidata']
{u'AllMovie artist ID (P2019)': u'p25206',
u'AlloCin\xe9 person ID (P1266)': u'11671',
u'BIBSYS ID (P1015)': u'90862409',
u'BNE ID (P950)': u'XX1358358',
u'BnF ID (P268)': u'13191060q',
u'CONOR ID (P1280)': u'39805539',
u'Commons category (P373)': u'Stephen Fry',
u'DNF person ID (P2626)': u'66163',
u'Discogs artist ID (P1953)': u'289153',
u'Elonet person ID (P2387)': u'241363',
u'Encyclop\xe6dia Britannica Online ID (P1417)': u'biography/Stephen-Fry',
u'FAST ID (P2163)': u'313699',
u'Filmportal ID (P2639)': u'8844ffd4f8964001a39a1c136dceea04',
u'Freebase ID (P646)': u'/m/0h0yt',
u'GND ID (P227)': u'115765646',
u'IMDb ID (P345)': u'nm0000410',
u'ISFDB author ID (P1233)': u'3347',
u'ISNI (P213)': [u'0000 0001 2129 064X', u'0000 0004 2241 3148'],
u'Instagram username (P2003)': u'stephenfryactually',
u'Internet Broadway Database person ID (P1220)': u'84850',
u'Kinopoisk person ID (P2604)': u'16465',
u'Last.fm music ID (P3192)': u'Stephen+Fry',
u'Library of Congress authority ID (P244)': u'n92115518',
u'MovieMeter director ID (P1969)': u'12195',
u'Munzinger IBA (P1284)': u'00000022844',
u'MusicBrainz artist ID (P434)': u'fad46635-5d90-484e-bcf9-5a8e3c1f8830',
u'NE.se ID (P3222)': u'stephen-fry',
u'NKCR AUT ID (P691)': u'jn19981001266',
u'NLR (Romania) ID (P1003)': u'RUNLRAUTH7766127',
u'NNDB people ID (P1263)': u'345/000055180',
u'NUKAT (WarsawU) authorities (P1207)': u'n96100436',
u'NYT topic ID (P3221)': u'person/stephen-fry',
u'National Portrait Gallery (London) person ID (P1816)': u'mp06527',
u'National Thesaurus for Author Names ID (P1006)': u'074121065',
u'Open Library ID (P648)': u'OL231965A',
u'PORT person ID (P2435)': u'11384',
u'PTBNP ID (P1005)': u'1469418',
u'Perlentaucher ID (P866)': u'stephen-fry',
u'Quora topic ID (P3417)': u'Stephen-Fry-actor',
u'SFDb person ID (P2168)': u'186797',
u'SUDOC authorities (P269)': u'035462418',
u'Scope.dk person ID (P2519)': u'4776',
u'Songkick artist ID (P3478)': u'81644',
u'Theatricalia person ID (P2469)': u'13fc',
u'Tumblr ID (P3943)': u'stephen-fry-me',
u'Twitter username (P2002)': u'stephenfry',
u'VIAF ID (P214)': [u'39518907', u'305718028'],
u'WikiTree ID (P2949)': u'Fry-2606',
u"audio recording of the subject's spoken voice (P990)": u'Stephen Fry voice.flac',
u'country of citizenship (P27)': u'United Kingdom (Q145)',
u'date of birth (P569)': u'+1957-08-24T00:00:00Z',
u'educated at (P69)': u"Queen's College (Q765642)",
u'employer (P108)': u'BBC (Q9531)',
u'given name (P735)': u'Stephen (Q4927100)',
u'image (P18)': u'Stephen Fry cropped.jpg',
u'instance of (P31)': u'human (Q5)',
u'languages spoken, written or signed (P1412)': u'English (Q1860)',
u'medical condition (P1050)': u'bipolar disorder (Q131755)',
u'movement (P135)': u'atheism (Q7066)',
u'name in native language (P1559)': u'Stephen John Fry',
u'nominated for (P1411)': [u'British Academy Television Award for Best Entertainment Performance (Q4969372)',
u'Tony Award for Best Featured Actor in a Play (Q1474410)',
u'Kentucky colonel (Q632482)'],
u'occupation (P106)': [u'actor (Q33999)',
u'comedian (Q245068)',
u'television presenter (Q947873)',
u'screenwriter (Q28389)',
u'autobiographer (Q18814623)',
u'writer (Q36180)',
u'director (Q3455803)',
u'television actor (Q10798782)',
u'novelist (Q6625963)',
u'stage actor (Q2259451)',
u'science fiction writer (Q18844224)',
u'film actor (Q10800557)'],
u'official website (P856)': u'http://www.stephenfry.com',
u'page banner (P948)': u'StephenFryWorldPride.jpg',
u'place of birth (P19)': u'Hampstead (Q25610)',
u'religion (P140)': u'atheism (Q7066)',
u'sex or gender (P21)': u'male (Q6581097)',
u'sexual orientation (P91)': u'homosexuality (Q6636)',
u'signature (P109)': u'Stephen Fry signature.svg',
u'spouse (P26)': u'Elliott Spencer (Q22808271)',
u"topic's main category (P910)": u'Category:Stephen Fry (Q8817795)',
u'website account on (P553)': u'Quora (Q51711)',
u'work period (start) (P2031)': u'+1982-00-00T00:00:00Z',
u'\u010cSFD person ID (P2605)': u'5127'}
You can minimize the number of Wikidata (labels) requests by specifying only the labels you want with wanted_labels()
. In the example below, we would normally make three or more calls for Wikidata labels, but let's assume we only want the gender property and the corresponding label ('sex or gender (P21)': 'female (Q6581072)'):
>>> page = wptools.page('Simone de Beauvoir')
>>> page.wanted_labels(['P21', 'Q6581072'])
>>> page.get_wikidata()
www.wikidata.org (wikidata) Simone de Beauvoir
www.wikidata.org (labels) P21|P31|Q5|Q6581072
Simone de Beauvoir (en) data
{
aliases: <list(10)> Simone-Lucie-Ernestine-Marie Bertrand de Bea...
claims: <dict(83)> P646, P723, P535, P800, P373, P648, P1273, P2...
description: <str(106)> French writer, intellectual, existential...
label: Simone de Beauvoir
labels: <dict(4)> P21, P31, Q5, Q6581072
modified: <dict(1)> wikidata
pageid: 8373
requests: <list(2)> wikidata, labels
title: Simone_de_Beauvoir
what: human
wikibase: Q7197
wikidata: <dict(2)> instance of (P31), sex or gender (P21)
wikidata_url: https://www.wikidata.org/wiki/Q7197
}
All the original claims are still there, but we've reduced the labels and wikidata we've resolved. We always get 'instance of (P31)' so that we know what we're looking at.
>>> page.data['wikidata']
{u'instance of (P31)': u'human (Q5)',
u'sex or gender (P21)': u'female (Q6581072)'}
Simply calling get()
on a page will automagically fetch extracts, images, infobox data, wikidata, and other metadata via the MediaWiki, Wikidata, and RESTBase APIs.
>>> page = wptools.page('Gandhi').get()
en.wikipedia.org (query) Gandhi
en.wikipedia.org (parse) 19379
www.wikidata.org (wikidata) Q1001
www.wikidata.org (claims) Q6581097|Q5|Q129286|Q6512732|Q668
en.wikipedia.org (restbase) /page/summary/Mahatma_Gandhi
en.wikipedia.org (imageinfo) File:Portrait Gandhi.jpg|File:MKGandhi.jpg
Mahatma Gandhi (en) data
{
aliases: <list(10)> M K Gandhi, Mohandas Gandhi, Bapu, Gandhi, M...
claims: <dict(5)> Q6581097, Q5, Q129286, Q6512732, Q668
description: <str(67)> pre-eminent leader of Indian nationalism ...
exhtml: <str(1064)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b>...
exrest: <str(896)> Mahātmā Mohandas Karamchand Gandhi (; Hindust...
extext: <str(2985)> Mahātmā **Mohandas Karamchand Gandhi** (; Hi...
extract: <str(3212)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b...
image: <list(6)> {'kind': 'query-pageimage', u'descriptionshortu...
infobox: <dict(25)> known_for, other_names, image, signature, bi...
label: Mahatma Gandhi
length: 264,127
links: <list(10)> https://biblio.wiki/wiki/Mohandas_K._Gandhi, h...
modified: <dict(2)> wikidata, page
pageid: 19379
parsetree: <str(333405)> <root><template><title>Redirect</title>...
properties: <dict(8)> P345, P910, P27, P21, P569, P18, P31, P570
random: Pukara (Moquegua)
title: Mahatma_Gandhi
url: https://en.wikipedia.org/wiki/Mahatma_Gandhi
url_raw: https://en.wikipedia.org/wiki/Mahatma_Gandhi?action=raw
watchers: 1,733
what: human
wikibase: Q1001
wikidata: <dict(8)> category, death, citizenship, gender, image,...
wikidata_url: https://www.wikidata.org/wiki/Q1001
wikitext: <str(262663)> {{Redirect|Gandhi}}{{pp-move-indef}}{{pp...
}
You can also call get_more()
to get further page data—like page files, categories, languages, contributors, and average daily views. This results in a more expensive (slower) query:
>>> page.get_more()
en.wikipedia.org (querymore) Gandhi
Mahatma Gandhi (en) data
{
categories: <list(67)> Category:1869 births, Category:1948 death...
contributors: 2,608
files: <list(52)> File:Aum Om red.svg, File:Commons-logo.svg, Fi...
languages: <list(167)> {u'lang': u'af', u'title': u'Mahatma Gand...
modified: <dict(1)> page
pageid: 19379
title: Mahatma Gandhi
views: 21,603
}