-
Notifications
You must be signed in to change notification settings - Fork 45
A brief introduction and demo
Rodrigo Palacios edited this page Apr 17, 2015
·
3 revisions
Greetings! This is @rodricios, author of eaiht, of which this repo is based off of.
But that's not entirely true. This repo also has an even better algorithm nicknamed libextract.strategies.TABULAR
. Why is it better? Because with the following script, you can download, parse and visualize data from the web.
Note: the example requires numpy, pandas, matplotlib. Download anaconda's easy installer for the whole shebang.
# for those using ipython notebook
%matplotlib inline
import re # regex
import numpy as np
import pandas as pd
from requests import get
from libextract import extract # our standard extraction method
from libextract.strategies import TABULAR # for when dealing with tables
# currently found in 'fuzzy-table-formatter' branch
from libextract import prototypes
url = "http://en.wikipedia.org/wiki/Human_height"
human_heights = get(url)
strat = TABULAR + (prototypes.convert_table,)
tabs = list(extract(human_heights.content, strategy=strat))
table = tabs[0]
df = pd.DataFrame.from_dict(table)
def convert_num(elem):
elem = re.split("^(.*)cm", elem)
elem = "".join(elem[:2]).strip()
if elem[0].isdigit():
return float(elem)
else:
return float('NaN')
df['Average female height'] = df['Average female height'].apply(convert_num)
df = df.set_index('Country/Region')
df = df[np.isfinite(df['Average female height'])]
df = df.sort(columns='Average female height', ascending=True)
#s10 = s['Average female height'][:10]
pd.options.display.mpl_style = 'default'
plt = df.plot(kind='barh',
figsize=[20, 80],
legend=True,
title="Countries Ranked by Average Female Height in cm.",
xticks=list(range(201))[0::5],
xlim=(140, 175))
And here's the output:
If you're wondering why there's multiple measurements for the same country, visit the wikipedia page libextract analyzed: Human heights.
What I am (with the generous help of @Eeo Jun) presenting with this library is a tool that hopefully helps you work more efficiently.