sidetable

sidetable started as a supercharged combination of pandas value_counts plus crosstab that builds simple but useful summary tables of your pandas DataFrame. It has since expanded to provide support for common and useful pandas tasks such as adding subtotals to your DataFrame or flattening hierarchical columns.

Usage is straightforward. Install and import sidetable. Then access it through the new .stb accessor on your DataFrame.

For the Titanic data: df.stb.freq(['class']) will build a frequency table like this:

	class	count	percent	cumulative_count	cumulative_percent
0	Third	491	55.1066	491	55.1066
1	First	216	24.2424	707	79.349
2	Second	184	20.651	891	100

You can also summarize missing values with df.stb.missing():

	missing	total	percent
deck	688	891	77.2166
age	177	891	19.8653
embarked	2	891	0.224467
embark_town	2	891	0.224467
survived	0	891	0
pclass	0	891	0
sex	0	891	0
sibsp	0	891	0
parch	0	891	0
fare	0	891	0
class	0	891	0
who	0	891	0
adult_male	0	891	0
alive	0	891	0
alone	0	891	0

You can group the data and add subtotals and grand totals with stb.subtotal():

df.groupby(['sex', 'class']).agg({'fare': ['sum']}).stb.subtotal()

		fare
		sum
sex	class
female	First	9975.8250
	Second	1669.7292
	Third	2321.1086
	female - subtotal	13966.6628
male	First	8201.5875
	Second	2132.1125
	Third	4393.5865
	male - subtotal	14727.2865
grand_total		28693.9493

You can also turn a hierarchical column structure into this:

titanic.groupby(['embark_town', 'class', 'sex']).agg({'fare': ['sum'], 'age': ['mean']}).unstack().stb.flatten()

	embark_town	class	fare_sum_female	fare_sum_male	age_mean_female	age_mean_male
0	Cherbourg	First	4972.53	3928.54	36.0526	40.1111
1	Cherbourg	Second	176.879	254.212	19.1429	25.9375
2	Cherbourg	Third	337.983	402.146	14.0625	25.0168
3	Queenstown	First	90	90	33	44
4	Queenstown	Second	24.7	12.35	30	57
5	Queenstown	Third	340.159	465.046	22.85	28.1429
6	Southampton	First	4753.29	4183.05	32.7045	41.8972
7	Southampton	Second	1468.15	1865.55	29.7197	30.8759
8	Southampton	Third	1642.97	3526.39	23.2237	26.5748

sidetable has several useful features:

See total counts and their relative percentages in one table. This is roughly equivalent to combining the output of value_counts() and value_counts(normalize=True) into one table.
Include cumulative totals and percentages to better understand your thresholds. The Pareto principle applies to many different scenarios and this function makes it easy to see how your data is cumulatively distributed.
Aggregate multiple columns together to see frequency counts for grouped data.
Provide a threshold point above which all data is grouped into a single bucket. This is useful for quickly identifying the areas to focus your analysis.
Get a count of the missing values in your data.
Count the number of unique values for each column.
Add grand totals on any DataFrame and subtotals to any grouped DataFrame.
Pretty print columns

Quickstart

For the impatient:

$ python -m pip install sidetable

import sidetable
import pandas as pd

# Create your DataFrame
df = pd.read_csv(myfile.csv)

# Build a frequency table for one or more columns
df.stb.freq(['column1', 'column2'])

# See what data is missing
df.stb.missing()

# Group data and add a subtotal
df.groupby(['column1', 'column2'])['col3'].sum().stb.subtotal()

That's it.

Read on for more details and more examples of what you can do sidetable.

Rationale

The idea behind sidetable is that there are a handful of useful data analysis tasks that you might run on any data set early in the data analysis process. While each of these tasks can be done in a handful of lines of pandas code, it is a lot of typing and difficult to remember.

In addition to providing useful functionality, this project is also a test to see how to build custom accessors using some of pandas relatively new API. I am hopeful this can serve as a model for other projects whether open source or just for your own usage. Please check out the release announcement for more information about the usage and how to use this as a model for your own projects.

The solutions in sidetable are heavily based on three sources:

This tweet thread by Peter Baumgartner
An excellent article by Steve Miller that lays out many of the code concepts incorporated into sidetable.
Ted Petrou's post on finding the percentage of missing values in a DataFrame.

I very much appreciate the work that all three authors did to point me in this direction.

Installation

$  python -m pip install -U sidetable

This is the preferred method to install sidetable, as it will always install the most recent stable release. sidetable requires pandas 1.0 or higher and no additional dependencies. It should run anywhere that pandas runs.

If you prefer to use conda, sidetable is available on conda-forge:

$ conda install -c conda-forge sidetable

Usage

import pandas as pd
import sidetable
import seaborn as sns

df = sns.load_dataset('titanic')

sidetable uses the pandas DataFrame accessor api to add a .stb accessor to all of your DataFrames. Once you import sidetable you are ready to go. In these examples, I will be using seaborn's Titanic dataset as an example but seaborn is not a direct dependency.

freq

If you have used value_counts() before, you have probably wished it were easier to combine the values with percentage distribution.

df['class'].value_counts()

Third     491
First     216
Second    184
Name: class, dtype: int64

df['class'].value_counts(normalize=True)

Third     0.551066
First     0.242424
Second    0.206510
Name: class, dtype: float64

Which can be done, but is messy and a lot of typing and remembering:

pd.concat([df['class'].value_counts().rename('count'), 
        df['class'].value_counts(normalize=True).mul(100).rename('percentage')], axis=1)

	count	percentage
Third	491	55.1066
First	216	24.2424
Second	184	20.651

Using sidetable is much simpler and you get cumulative totals, percents and more flexibility:

df.stb.freq(['class'])

	class	count	percent	cumulative_count	cumulative_percent
0	Third	491	55.1066	491	55.1066
1	First	216	24.2424	707	79.349
2	Second	184	20.651	891	100

If you want to style the results so percentages and large numbers are easier to read, use style=True:

df.stb.freq(['class'], style=True)

	class	count	percent	cumulative_count	cumulative_percent
0	Third	491	55.11%	491	55.11%
1	First	216	24.24%	707	79.35%
2	Second	184	20.65%	891	100.00%

In addition, you can group columns together. If we want to see the breakdown among class and sex:

df.stb.freq(['sex', 'class'])

	sex	class	count	percent	cumulative_count	cumulative_percent
0	male	Third	347	38.945	347	38.945
1	female	Third	144	16.1616	491	55.1066
2	male	First	122	13.6925	613	68.7991
3	male	Second	108	12.1212	721	80.9203
4	female	First	94	10.5499	815	91.4703
5	female	Second	76	8.52974	891	100

You can use as many groupings as you would like.

By default, sidetable counts the data. However, you can specify a value argument to indicate that the data should be summed based on the data in another column. For this data set, we can see how the fares are distributed by class:

df.stb.freq(['class'], value='fare')

	class	fare	percent	cumulative_fare	cumulative_percent
0	First	18177.4	63.3493	18177.4	63.3493
1	Third	6714.7	23.4011	24892.1	86.7504
2	Second	3801.84	13.2496	28693.9	100

Another feature of sidetable is that you can specify a threshold. For many data analysis, you may want to break down into large groupings to focus on and ignore others. You can use the thresh argument to define a threshold and group all entries above that threshold into an "other" grouping:

df.stb.freq(['class', 'who'], value='fare', thresh=80)

	class	who	fare	percent	cumulative_fare	cumulative_percent
0	First	woman	9492.94	33.0834	9492.94	33.0834
1	First	man	7848.18	27.3513	17341.1	60.4348
2	Third	man	3617.53	12.6073	20958.6	73.042
3	Second	man	1886.36	6.57406	22845	79.6161
4	others	others	5848.95	20.3839	28693.9	100

You can further customize by specifying the label to use for all the others:

df.stb.freq(['class', 'who'], value='fare', thresh=80, other_label='All others')

	class	who	fare	percent	cumulative_fare	cumulative_percent
0	First	woman	9492.94	33.0834	9492.94	33.0834
1	First	man	7848.18	27.3513	17341.1	60.4348
2	Third	man	3617.53	12.6073	20958.6	73.042
3	Second	man	1886.36	6.57406	22845	79.6161
4	All others	All others	5848.95	20.3839	28693.9	100

counts

The counts() function shows how many unique values are in each column as well as the most and least frequent values & their total counts. This summary view can help you determine if you need to convert data to a categorical value. It can also help you understand the high level structure of your data.

df.stb.counts()

	count	unique	most_freq	most_freq_count	least_freq	least_freq_count
survived	891	2	0	549	1	342
sex	891	2	male	577	female	314
adult_male	891	2	True	537	False	354
alive	891	2	no	549	yes	342
alone	891	2	True	537	False	354
pclass	891	3	3	491	2	184
embarked	889	3	S	644	Q	77
class	891	3	Third	491	Second	184
who	891	3	man	537	child	83
embark_town	889	3	Southampton	644	Queenstown	77
sibsp	891	7	0	608	5	5
parch	891	7	0	678	6	1
deck	203	7	C	59	G	4
age	714	88	24.0	30	20.5	1
fare	891	248	8.05	43	63.3583	1

By default, all data types are included but you may use the exclude and include parameters to select specific types of columns. The syntax is the same as pandas select_dtypes

For example,

df.stb.counts(exclude='number')

	count	unique	most_freq	most_freq_count	least_freq	least_freq_count
sex	891	2	male	577	female	314
adult_male	891	2	True	537	False	354
alive	891	2	no	549	yes	342
alone	891	2	True	537	False	354
embarked	889	3	S	644	Q	77
class	891	3	Third	491	Second	184
who	891	3	man	537	child	83
embark_town	889	3	Southampton	644	Queenstown	77
deck	203	7	C	59	G	4

missing

sidetable also includes a summary table that shows the missing values in your data by count and percentage of total missing values in a column.

df.stb.missing()

	missing	total	percent
deck	688	891	77.2166
age	177	891	19.8653
embarked	2	891	0.224467
embark_town	2	891	0.224467
survived	0	891	0
pclass	0	891	0
sex	0	891	0
sibsp	0	891	0
parch	0	891	0
fare	0	891	0
class	0	891	0
who	0	891	0
adult_male	0	891	0
alive	0	891	0
alone	0	891	0

If you wish to see the results with styles applied to the Percent and Total column, use:

df.stb.missing(style=True)

	missing	total	percent
deck	688	891	77.22%
age	177	891	19.87%
embarked	2	891	0.22%
embark_town	2	891	0.22%
survived	0	891	0
pclass	0	891	0
sex	0	891	0
sibsp	0	891	0
parch	0	891	0
fare	0	891	0
class	0	891	0
who	0	891	0
adult_male	0	891	0
alive	0	891	0
alone	0	891	0

Finally, you can exclude the columns that have 0 missing values using the clip_0=True parameter:

df.stb.missing(clip_0=True, style=True)

	missing	total	percent
deck	688	891	77.22%
age	177	891	19.87%
embarked	2	891	0.22%
embark_town	2	891	0.22%

subtotal

Another useful function is the subtotal function. Trying to add a subtotal to grouped pandas data is not easy. sidetable adds a subtotal() function that makes adds a subtotal at one or more levels of a DataFrame.

The subtotal function can be applied to a simple DataFrame in order to add a Grand Total label:

df.stb.subtotal()

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
887	1	1	female	19	0	0	30	S	First	woman	0	B	Southampton	yes	1
888	0	3	female	nan	1	2	23.45	S	Third	woman	0	nan	Southampton	no	0
889	1	1	male	26	0	0	30	C	First	man	1	C	Cherbourg	yes	1
890	0	3	male	32	0	0	7.75	Q	Third	man	1	nan	Queenstown	no	1
grand_total	342	2057	nan	21205.2	466	340	28693.9	nan	nan	nan	537	nan	nan	nan	537

The real power of subtotal is being able to add it to one or more levels of your grouped data. For example, you can group the data and add a subtotal at each level:

df.groupby(['sex', 'class', 'embark_town']).agg({'fare': ['sum']}).stb.subtotal()

Which yields this view (truncated for simplicity):

			fare
			sum
sex	class	embark_town
female	First	Cherbourg	4972.5333
		Queenstown	90.0000
		Southampton	4753.2917
		female \| First - subtotal	9815.8250
	Second	Cherbourg	176.8792
		Queenstown	24.7000
		Southampton	1468.1500
		female \| Second - subtotal	1669.7292
	Third	Cherbourg	337.9833
		Queenstown	340.1585
		Southampton	1642.9668
		female \| Third - subtotal	2321.1086
	female - subtotal		13806.6628
male	First	Cherbourg	3928.5417
male	First	Queenstown	90.0000

By default, every level in the DataFrame will be subtotaled but you can control this behavior by using the sub_level argument. For instance, you can subtotal on sex and class by passing the argument sub_level=[1,2]

summary_table = df.groupby(['sex', 'class', 'embark_town']).agg({'fare': ['sum']})
summary_table.stb.subtotal(sub_level=[1, 2])

The subtotal function also allows the user to configure the labels and separators used in the subtotal and Grand Total by using the grand_label, sub_label, show_sep and sep arguments.

flatten

When grouping and pivoting data, you can end up with a DataFrame that has a multiindex. Often times, you want a simple flat representation of the data.

For example, we can build a table using a groupby() plus unstack() that looks like this:

df.groupby(['embark_town', 'class', 'sex']).agg({'fare': ['sum'], 'age': ['mean']}).unstack()

		fare		age
		sum		mean
	sex	female	male	female	male
embark_town	class
Cherbourg	First	4972.5333	3928.5417	36.052632	40.111111
	Second	176.8792	254.2125	19.142857	25.937500
	Third	337.9833	402.1462	14.062500	25.016800
Queenstown	First	90.0000	90.0000	33.000000	44.000000
	Second	24.7000	12.3500	30.000000	57.000000
	Third	340.1585	465.0458	22.850000	28.142857
Southampton	First	4753.2917	4183.0458	32.704545	41.897188
	Second	1468.1500	1865.5500	29.719697	30.875889
	Third	1642.9668	3526.3945	23.223684	26.574766

If you wish to flatten it, use stb.flatten():

df.groupby(['embark_town', 'class', 'sex']).agg({'fare': ['sum'], 'age': ['mean']}).unstack().stb.flatten()

	embark_town	class	fare_sum_female	fare_sum_male	age_mean_female	age_mean_male
0	Cherbourg	First	4972.53	3928.54	36.0526	40.1111
1	Cherbourg	Second	176.879	254.212	19.1429	25.9375
2	Cherbourg	Third	337.983	402.146	14.0625	25.0168
3	Queenstown	First	90	90	33	44
4	Queenstown	Second	24.7	12.35	30	57
5	Queenstown	Third	340.159	465.046	22.85	28.1429
6	Southampton	First	4753.29	4183.05	32.7045	41.8972
7	Southampton	Second	1468.15	1865.55	29.7197	30.8759
8	Southampton	Third	1642.97	3526.39	23.2237	26.5748

flatten will also take additional arguments:

Add a custom separator using the sep argument - stb.flatten(sep='|')
Control whether or not to reset the index using reset argument - stb.flatten(reset=False)
Reorganize the output levels using levels argument levels=2
- levels can also take a list of valid levels if you want to reorganize the display levels=[0,2]

fares = df.groupby(['embark_town', 'class', 'sex']).agg({'fare': ['sum'], 'age': ['mean']}).unstack()
fares.stb.flatten(sep='|', reset=False, levels=[0,2])

		fare\|female	fare\|male	fare\|female	fare\|male	age\|female	age\|male
embark_town	class
Cherbourg	First	4972.5333	3928.5417	115.640309	93.536707	36.052632	40.111111
	Second	176.8792	254.2125	25.268457	25.421250	19.142857	25.937500
	Third	337.9833	402.1462	14.694926	9.352237	14.062500	25.016800
Queenstown	First	90.0000	90.0000	90.000000	90.000000	33.000000	44.000000
	Second	24.7000	12.3500	12.350000	12.350000	30.000000	57.000000
	Third	340.1585	465.0458	10.307833	11.924251	22.850000	28.142857
Southampton	First	4753.2917	4183.0458	99.026910	52.949947	32.704545	41.897188
	Second	1468.1500	1865.5500	21.912687	19.232474	29.719697	30.875889
	Third	1642.9668	3526.3945	18.670077	13.307149	23.223684	26.574766

prettyprint

This function interprets the magnitude of your numeric results and returns a nicely formatted version of all the numbers. This can be used on a full DataFrame or during your analysis of aggregated data.

For instance, if you are summarizing data, you may get something that looks like this:

df.groupby(['pclass', 'sex']).agg({'fare': 'sum'})

		fare
pclass	sex
1	female	9975.8250
1	male	8201.5875
2	female	1669.7292
2	male	2132.1125
3	female	2321.1086
3	male	4393.5865

Use stb.pretty() to format it nicely so you can have the same order or magnitude for all numbers:

df.groupby(['pclass', 'sex']).agg({'fare': 'sum'}).div(df['fare'].sum()).stb.pretty()

		fare
pclass	sex
1	female	9.98k
1	male	8.20k
2	female	1.67k
2	male	2.13k
3	female	2.32k
3	male	4.39k

Here's an example of a percentage format:

df.groupby(['pclass', 'sex']).agg({'fare': 'sum'}).div(df['fare'].sum()).stb.pretty(precision=0, caption="Fare Percentage")

Fare Percentage

		fare
pclass	sex
1	female	35%
1	male	29%
2	female	6%
2	male	7%
3	female	8%
3	male	15%

Behind the scenes, pretty will attempt to normalize the values. You can control the precision, rows add a caption.

Caveats

sidetable supports grouping on any data type in a pandas DataFrame. This means that you could try something like:

df.stb.freq(['fare'])

In some cases where there are a fairly small discrete number of this may be useful. However, if you have a lot of unique values, you should bin the data first. In the example, above the data would include 248 rows and not be terribly useful.

One alternative could be:

df['fare_bin'] = pd.qcut(df['fare'], q=4, labels=['low', 'medium', 'high', 'x-high'])
df.stb.freq(['fare_bin'])

	fare_bin	count	percent	cumulative_count	cumulative_percent
0	medium	224	25.1403	224	25.1403
1	low	223	25.0281	447	50.1684
2	x-high	222	24.9158	669	75.0842
3	high	222	24.9158	891	100

The other caveat is that null or missing values can cause data to drop out while aggregating. For instance, if we look at the deck variable, there are a lot of missing values.

df.stb.freq(['deck'])

	deck	count	percent	cumulative_count	cumulative_percent
0	C	59	29.064	59	29.064
1	B	47	23.1527	106	52.2167
2	D	33	16.2562	139	68.4729
3	E	32	15.7635	171	84.2365
4	A	15	7.38916	186	91.6256
5	F	13	6.40394	199	98.0296
6	G	4	1.97044	203	100

The total cumulative count only goes up to 203 not the 891 we have seen in other examples. Future versions of sidetable may handle this differently. For now, it is up to you to decide how best to handle unknowns. For example, this version of the Titanic data set has a categorical value for deck so using fillna requires an extra step:

df['deck_fillna'] = df['deck'].cat.add_categories('UNK').fillna('UNK')
df.stb.freq(['deck_fillna'])

	deck_fillna	count	percent	cumulative_count	cumulative_percent
0	UNK	688	77.2166	688	77.2166
1	C	59	6.62177	747	83.8384
2	B	47	5.27497	794	89.1134
3	D	33	3.7037	827	92.8171
4	E	32	3.59147	859	96.4085
5	A	15	1.6835	874	98.092
6	F	13	1.45903	887	99.5511
7	G	4	0.448934	891	100

Another variant is that there might be certain groupings where there are no valid counts.

For instance, if we look at the deck and class:

df.stb.freq(['deck', 'class'])

	deck	class	count	percent	cumulative_count	cumulative_percent
0	C	First	59	29.064	59	29.064
1	B	First	47	23.1527	106	52.2167
2	D	First	29	14.2857	135	66.5025
3	E	First	25	12.3153	160	78.8177
4	A	First	15	7.38916	175	86.2069
5	F	Second	8	3.94089	183	90.1478
6	F	Third	5	2.46305	188	92.6108
7	G	Third	4	1.97044	192	94.5813
8	E	Second	4	1.97044	196	96.5517
9	D	Second	4	1.97044	200	98.5222
10	E	Third	3	1.47783	203	100

There are only 11 combinations. If we want to see all - even if there are not any passengers fitting that criteria, use clip_0=False

df.stb.freq(['deck', 'class'], clip_0=False)

	deck	class	count	percent	cumulative_count	cumulative_percent
0	C	First	59	29.064	59	29.064
1	B	First	47	23.1527	106	52.2167
2	D	First	29	14.2857	135	66.5025
3	E	First	25	12.3153	160	78.8177
4	A	First	15	7.38916	175	86.2069
5	F	Second	8	3.94089	183	90.1478
6	F	Third	5	2.46305	188	92.6108
7	G	Third	4	1.97044	192	94.5813
8	E	Second	4	1.97044	196	96.5517
9	D	Second	4	1.97044	200	98.5222
10	E	Third	3	1.47783	203	100
11	G	Second	0	0	203	100
12	G	First	0	0	203	100
13	F	First	0	0	203	100
14	D	Third	0	0	203	100
15	C	Third	0	0	203	100
16	C	Second	0	0	203	100
17	B	Third	0	0	203	100
18	B	Second	0	0	203	100
19	A	Third	0	0	203	100
20	A	Second	0	0	203	100

In many cases this might be too much data, but sometimes the fact that a combination is missing could be insightful.

The final caveat relates to subtotal. When working with the subtotal function, sidetable convert a Categorical MultiIndex to a plain index in order to easily add the subtotal labels.

TODO

Handle NaN values more effectively
Offer binning options for continuous variables
Offer more options, maybe plotting?

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given. If you have a new idea for a simple table that we should add, please submit a ticket.

For more info please click here

Credits

This package was created with Cookiecutter and the oldani/cookiecutter-simple-pypackage project template. The code used in this package is heavily based on the posts from Peter Baumgartner, Steve Miller and Ted Petrou. Thank you!

Cookiecutter
oldani/cookiecutter-simple-pypackage
Peter Baumgartner - tweet thread
Steve Miller - article
Ted Petrou - post

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github		.github
sidetable		sidetable
tests		tests
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CONTRIBUTING.md		CONTRIBUTING.md
HISTORY.md		HISTORY.md
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sidetable

Table of Contents:

Quickstart

Rationale

Installation

Usage

freq

counts

missing

subtotal

flatten

prettyprint

Caveats

TODO

Contributing

Credits

About

Releases 9

Sponsor this project

Contributors 3

Languages

License

chris1610/sidetable

Folders and files

Latest commit

History

Repository files navigation

sidetable

Table of Contents:

Quickstart

Rationale

Installation

Usage

freq

counts

missing

subtotal

flatten

prettyprint

Caveats

TODO

Contributing

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Sponsor this project

Contributors 3

Languages