script.txt


Session 1
=========

You should do install if you have not already.
We have Internet!  But, I have USB keys too.
Format: lesson, exercises, solutions.

Open Exercise-1
Point out .from_csv functions

VIEWING

len(titles)
titles.head() and .head(20)
titles.tail() and .tail(10)
titles

FILTERING

h = titles.head()
h['year'] or h.year

h.year + 1000
h.year - 2000

h.year > 1960
h[h.year > 1960]
h[h.year > 1960 & h.year < 1970]
h[(h.year > 1960) & (h.year < 1970)]
t.year // 10 * 10
h[h.title == '...']

SORTING

titles.sort(['title'])
titles.sort(['year'])
titles.sort(['year', 'title'])

Session 2
=========

STRING METHODS

h.str.len()
h.str.startswith(s)
h.str.extract(RE)

AGGREGATION

titles.year.value_counts()
titles.year.value_counts().plot()  whoops!
titles.year.index
titles.year.value_counts().sort_index().plot()
titles.year.value_counts().sort_index().plot(kind='bar')

c = cast
c = c[c.character == 'Kermit the Frog]
c.plot(x='year', y='n', kind='scatter')

COLUMNS

Can be hard to see data

c = cast
c = c[c.character == 'Kermit the Frog']
c = c[['year', 'n']]
c

Can also:

c[['year']]

Session 3
=========

INDEXES - SPEED

%%time cast[cast.title == 'Sleuth']
c = cast.set_index(['title'])
%%time c.loc['Sleuth']
c = cast.set_index(['title']).sort_index()
%%time c.loc['Sleuth']

c = cast.set_index(['title', 'year']).sort_index()
c.loc['Sleuth']
c.loc['Sleuth',1996]
c.loc[('Sleuth',1996),'character']
c.loc[('Sleuth',1996),('character','n')]

.reset_index('title')
.reset_index('year')
.reset_index(['title', 'year'])
.reset_index()

INDEXES - GROUP BY

c = cast
c = c[c.name == 'George Clooney']
c.groupby(['title', 'year', 'character']).size()

c = cast
c = c[c.name == 'George Clooney']
c.groupby(['character', 'title', 'year']).size()

c = cast
c = c[c.name == 'George Clooney']
c.groupby(['character']).size()

# How many times has he had two roles in the same film?

c = cast
c = c[c.name == 'George Clooney']
c = c.groupby(['year', 'title']).size()
c[c > 1]

c = cast
c = c[c.name == 'George Clooney']
c.groupby([c.year // 10 * 10, 'character']).size()

c = cast
c = c[c.name == 'George Clooney']
c.groupby(['character', c.year // 10 * 10]).size()

TODO: mean min max!

Session 4
=========

UNSTACK

c = cast
c = c[(c.character == 'Kermit the Frog') | (c.character == 'Oscar the Grouch')]
g = c.groupby(['character', c.year // 10 * 10]).size()
g

How can we compare years?  Unstack!

g.unstack('year')
g.unstack('character')

u = g.unstack('character')
u['difference'] = u['Kermit the Frog'] - u['Oscar the Grouch']
u

But, NaN.

u = g.unstack('character').fillna(0)
u['difference'] = u['Kermit the Frog'] - u['Oscar the Grouch']
u

THE DANGERS OF UNSTACK

Do it again?  Oh no, we get a series!

.stack() again to repair damage, BUT can devolve to series again.

PLOTTING

Ratio?

u = g.unstack('character')
total = u['Oscar the Grouch'] + u['Kermit the Frog']
u['difference'] = u['Oscar the Grouch'] / total
u.difference.plot(ylim=[0,1])

Indexing and grouping has been moving our data LEFT.
"Unstacking" moves it UP, to columns!  Stacking, DOWN.

Session 5
=========

r = release_dates
r = r[r.title == 'Inception']
r.date.dt.year

year month date dayofweek dayofyear

MERGE

What if we were interested in fetching release dates,
NOT by information in that table itself,
but by information over in "cast"?

c = cast
c = c[c.name == 'Ellen Page']
c = c.merge(release_dates)
c

Session 6
=========

c = cast
c = c[c.n <= 2]
c = c[c.name == 'Cary Grant']

c = c.merge(cast, on=['title', 'year'])
c = c[c.n_y <= 2]
c = c[c.name_y != 'Cary Grant']
c = c[['title', 'year', 'name_x', 'name_y']]
c

c.groupby('name_y').size().order(ascending=False)

reindex? or what? yeah.
.dropna()
.info()

Pivot

r = release_dates
r = r[r.title.str.startswith('Star Wars: Episode')]
r = r[r.country.str.startswith('U')]
r.pivot('title', 'country', 'date')

which is the same as

r.set_index(['title', 'country'])[['date']].unstack()

.rename(columns={...})
.concat(df)

Thoughts for later
==================

(who had which co-stars how often)
(what pairs of co-stars have appeared the most often together)
Can you use merge to find who was in movies with each other?

Fix later: second exercise s/hamlet/batman/