Exploratory Data Analysis

Exploratory Data Analysis with Python and Seaborn.

Data Science Process

(Image Credit: Wikimedia Commons, CC)

Learning objectives

Describe main characteristics of dataset: number of rows/columns, missing data, data types, preview.
How to clean corrupted data, handle missing data, invalid data types, incorrect values.
Visualize data distributions using the Seaborn Library: bar plots, count plots, histograms, box plots, violin plots, and more
Calculate and visualize correlations (relationships) between variables with the help of a heat map.

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis is a Statistics approach of analyzing data sets in order to quickly summarize their main characteristics, and may be supported with simple data visualization like box plots, histograms,scatter plots, cummulative distribution functions, quantile-quantile (Q-Q) plots, among others.

John W. Tukey wrote the book Exploratory Data Analysis in 1977, where he held that too much emphasis in statistics was placed on statistical hypothesis testing and more emphasis needed to be placed on using data to suggest hypotheses to test. Exploratory Data Analysis does not need any previous assumption on the statistical distribution of the underlying data.

Tukey suggested computing the five number summary of numerical data: the two extremes (maximum and minimum), the median, and the quartiles since they are defined for all empirical distribution.

Turkey also gives a criteria for defining outlier data. If Q₁, and Q₃ are the first and third quartile positions, the interquartile range IQR = Q₃ - Q₁ , then an outlier value will fall below Q₁ - 1.5 IQR or above Q₃ + 1.5 IQR.

Tuckey Outlier Criteria

(Image credit: UF Biostatistics Open Learning Textbook, CC)

Common plots.

Histograms summarize the distribution of the data, by placing observations into intervals (bins) and counting the number of observations in each interval.

Boxplots are a box and whisker plot, which provides a compact summary of the distribution of a variable. A standard boxplot consists of:

a box defined by the 25th and 75th percentiles,
a horizontal line or point on the box at the median, and
vertical lines (whiskers) drawn from each hinge (quartile) to the extreme value.

The cumulative distribution function (CDF) is a function F(X) that is the probability that the observations of a variable are not larger than a specified value.

A quantile-quantile (Q-Q) plot, or probability plot, is a graphical means for comparing a variable to a particular, theoretical distribution or to compare it to the distribution of another variable. One common application of the Q-Q plot is to check whether a variable is normally distributed.

Scatterplots are graphical displays of matched data plotted with one variable on the horizontal axis and the other variable on the vertical axis.

Data analysis with Pandas

We enlist some functions which are useful in a EDA. We will find some function previously used.

Function	Description
df.columns	Prints column names of dataframe
df.compare()	Compare one dataframe with another and show differences
df.corr()	Compute pairwise correlation between columns excluding NaN/Null values
df.describe()	Generate descriptive statistics of numerical values
df.dropna()	Removes row or column with missing values
df.fillna()	Fill NaN/Null values using a specified method
df.head()	Prints first n=5 rows of a dataframe
df.info()	Print summary of dataframe
df.interpolate()	Fill NaN values using an interpolation method
df.isnull().sum()	Sums the number of missing data
df.query()	Query the columns of a dataframe with a boolean operator
df.sample()	Return a random sample of items from a dataframe row
df.shape	Prints the dimensions of a dataframe (rows, columns)
df.tail()	Returns the last n=5 rows of a dataframe
df.types	Prints data types of each column
pd.Series.unique()	Returns unique values from the series

Additional Pandas Tools	Situations
merge, join, concatenate and compare	Forms of combining different data frames
Working with missing data	Posible available options when missing data
Group by - split, apply, combine	Pandas objects can be split on any of their axes

Sidetable Library for Pandas

Optional (Click me)

There is another library we can use in doing Exploratory Data Analysis, this is the Sidetable Library written by Chris Moffitt.

To install it from a Jupyter Notebook we can enter the pip command:

!pip install sidetable

or if we are using conda, from a terminal run

conda install -c conda-forge sidetable

After we have sidetable installed, we load it into the system working memory

import pandas as pd
import sidetable

The functions we will cover are:

Freq function
Counts function
Missing function
Subtotal function

Freq function

Freq function returns a dataframe that conveys 3 pieces of information.

The number of observations (i.e. rows) for each category (value_counts()).
The percentage of each category in the entire column (value_counts(normalize=True)).
The cumulative versions of the two above.

Counts function

Another useful function of sidetable is the count function. It returns the number of unique values in each column along with some other measures.

The number of non-missing values in each column
The number of unique categories in each column
The most and least frequent categories in each column
The number of values that belong the most and least frequent columns

Missing function

The missing function is pretty simple. It returns the count and percentage of missing values in each column.

Subtotal function

The subtotal function is best used with the groupby function of Pandas. It adds a subtotal for levels of the grouping.

This is an example on how to use sidetable, which will be called thru the Pandas accessor df.stb.

import sidetable
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Compute the frequency of one variable
df.stb.freq(['column1'], style=True)

# Build a frequency table for one or more columns
df.stb.freq(['column1', 'column2'], style=True)

# See what data is missing
df.stb.missing()

# Group data and add a subtotal
df.groupby(['column1', 'column2'])['col3'].sum().stb.subtotal()

We continue with the Titanic dataset loaded into the dataframe df.

Function	Description
`df.stb.freq(['class'], style=True)`	Similar to Pandas `df['column1'].value_counts(normalize=True)`
`df.stb.freq(['sex', 'class'], style=True)`	You can group more than one columns together.
`df.stb.freq(['class'], value='fare')`	Specifying a `value` argument, the data should be summed based on the data in another column.
`df.stb.freq(['class', 'who'], value='fare', thresh=80)`	Using the `thresh` to define a threshold, selecting only values above that threshold.
`df.stb.freq(['class', 'who'], value='fare', thresh=80, other_label='All others')`	Specify the label to be used for all the others.
`df.stb.counts()`	Shows how many unique values, most and least frequent values and total count.
`df.stb.counts(exclude='number')`	Excludes numeric values (Same syntax as DataFrame.select_dtypes)
`df.stb.missing(style=True)`	Summary of missing values.
`df.stb.missing(clip_0=True, style=True)`	Exclude variables with `0` missing values.
`df.stb.subtotal()`	Adds a Grand Total label.

(Please see more details in the sidetable documentation)

Low code EDA libraries

Automated EDA packages can perform EDA in a few lines of Python code.

Here is a small list of them:

See Low-code EDA Tools

The Seaborn Visualization Library

Please see Slides.

The Seaborn Library is based on the general visualization library Matplotlib. Seaborn makes visualization of a dataset statistical properties more easier to use.

There are several types of graphics that we can produce with Seaborn, we will only show a small set of them, that can be used in performing an Exploratory Data Analysis.

Seaborn standard plotting functions

Function	Description
Relational Plots
sns.scatterplot()	Basic relational plot between variables
sns.lineplot()	Plot lines between values
Distribution Plots
sns.histplot()	Basic frequency distribution plot
sns.kdeplot()	The kernel density estimation plot
Categorical Plots
sns.stripplot()	Basic distribution categorical plot
sns.swarmplot	Categorical plot without overlapping points
sns.boxplot()	Categorical box plots
sns.violinplot()	Categorical violin plots
sns.boxenplot()	Enhanced boxplot for larger datasets
sns.pointplot()	Point estimates and confidence intervals using scatter plot glyphs
sns.barplot()	Point estimates and confidence intervals as rectangular bars
sns.countplot()	Counts of observations in each categorical bin using bars
Regression Plots
sns.lmplot()	Plot data and regression model fits
Matrix Plots
sns.heatmap()	Plot rectangular data as a color-encoded matrix
Multiplot grids
sns.FacetGrid()	Multi-plot grid for plotting conditional relationships
sns.pairplot()	Plot pairwise relationships in a dataset
sns.joint.plot()	Draw a plot of two variables with bivariate and univariate graphs

Seaborn objects interface

Optional (Click me)

The seaborn.objects are a new interface for making Seaborn plots. It offers a more consistent and flexible API, comprising a collection of composable classes for transforming and plotting data.

The objects interface should be imported with the following convention:

import seaborn.objects as so

The seaborn.objectsare composed of classes, being Plot the most important. You specify plots by instantiating a Plot object and calling its methods.

Object	Description
so.Plot()	An interface for declaratively specifying statistical graphics.
so.Dot	A mark suitable for dot plots or less-dense scatterplots.
so.Dots()	A dot mark defined by strokes to better handle overplotting.
so.Line()	A mark connecting data points with sorting along the orientation axis.
so.Lines()	A faster but less-flexible mark for drawing many lines.
so.Path()	A mark connecting data points in the order they appear.
so.Paths()	A faster but less-flexible mark for drawing many paths.
so.Dash()	A line mark drawn as an oriented segment for each datapoint.
so.Range()	An oriented line mark drawn between min/max values.
so.Bar()	A bar mark drawn between baseline and data values.
so.Bars()	A faster bar mark with defaults more suitable histograms.
so.Area()	A fill mark drawn from a baseline to data values.
so.Band()	A fill mark representing an interval between values.
so.Text()	A textual mark to annotate or represent data values.
so.Agg(func='mean')	Aggregate data along the value axis using given method.
so.Est()	Calculate a point estimate and error bar interval.
so.Count()	Count distinct observations within groups.
so.Hist	Bin observations, count them, and optionally normalize or cumulate.
so.Perc(k=5, method='linear')	Replace observations with percentile values.
so.PolyFit(order=2, gridsize=100)	Fit a polynomial of the given order and resample data onto predicted curve.
so.Dodge(empty='keep', gap=0, by=None)	Displacement and narrowing of overlapping marks along orientation axis.
so.Norm(func='max', where=None, by=None, percent=False)	Divisive scaling on the value axis after aggregating within groups.
so.Stack()	Displacement of overlapping bar or area marks along the value axis.

Jupyter Notebook Examples

General References