Skip to content

A brief introduction and demo

Rodrigo Palacios edited this page Apr 17, 2015 · 3 revisions

Greetings! This is @rodricios, author of eaiht, of which this repo is based off of.

But that's not entirely true. This repo also has an even better algorithm nicknamed libextract.strategies.TABULAR. Why is it better? Because with the following script, you can download, parse and visualize data from the web.

Note: the example requires numpy, pandas, matplotlib. Download anaconda's easy installer for the whole shebang.

# for those using ipython notebook
%matplotlib inline 
import re # regex
import numpy as np
import pandas as pd 
from requests import get

from libextract import extract # our standard extraction method
from libextract.strategies import TABULAR # for when dealing with tables
# currently found in 'fuzzy-table-formatter' branch
from libextract import prototypes

url = "http://en.wikipedia.org/wiki/Human_height"
human_heights = get(url)

strat = TABULAR + (prototypes.convert_table,)
tabs = list(extract(human_heights.content, strategy=strat))
table = tabs[0]
df = pd.DataFrame.from_dict(table)

def convert_num(elem):
    elem = re.split("^(.*)cm", elem)
    elem = "".join(elem[:2]).strip()
    if elem[0].isdigit():
        return float(elem)
    else:
        return float('NaN')

df['Average female height'] = df['Average female height'].apply(convert_num)

df = df.set_index('Country/Region')
df = df[np.isfinite(df['Average female height'])]
df = df.sort(columns='Average female height', ascending=True)
#s10 = s['Average female height'][:10]
pd.options.display.mpl_style = 'default'
plt = df.plot(kind='barh', 
         figsize=[20, 80], 
         legend=True,
         title="Countries Ranked by Average Female Height in cm.",
         xticks=list(range(201))[0::5],
         xlim=(140, 175))

And here's the output:

female heights sorted

If you're wondering why there's multiple measurements for the same country, visit the wikipedia page libextract analyzed: Human heights.

What I am (with the generous help of @Eeo Jun) presenting with this library is a tool that hopefully helps you work more efficiently.

Clone this wiki locally