Brought to you by Lesley Cordero.
- 0.0 Setup
- 1.0 Introduction
- 2.0 Basic Plots
- 3.0 Matplotlib
- 4.0 Seaborn
- 5.0 ggplot
- 1.0 Introduction
- 2.0 Seaborn
- 3.0 Bokeh
- 4.0 Advanced Graphs
This guide was written in Python 3.5.
We'll be needing different modules. Let's install the ones we'll need for this tutorial. Open up your terminal and enter the following commands to install the needed python modules:
pip3 install ggplot
pip3 install seaborn
pip3 install requests
pip3 install bokeh
pip3 install seaborn
pip3 install requests
pip3 install bokeh
pip3 install Image
pip3 install wordcloud
If you'd like to work in a virtual environment, you can set it up as follows:
pip3 install virtualenv
virtualenv your_env
And then launch it with:
source your_env/bin/activate
To execute the visualizations in matplotlib, do the following:
cd ~/.matplotlib
vim matplotlibrc
And then, write backend: TkAgg
in the file. Now you should be set up with your virtual environment!
Cool, now we're ready to start!
Data Visualization is an important and exciting aspect of data science. It reveals information we otherwise wouldn't have noticed. It allows us to showcase the work we've done through visualizations, which can be stagnant or interactive.
matplotlib is a 2D python plotting library which allows you to generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code.
bokeh is an interactive visualization library for modern web browsers presentation.
seaborn is a library for making statistical graphics in Python. It's built on top of matplotlib and is tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.
ggplot is a plotting system built for making profressional-looking plots quickly with minimal code.
matplotlib.pyplot
is a collection of functions that make matplotlib work similar to matlab. Each pyplot function makes some change to a figure. In matplotlib.pyplot
various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes.
In this section, we'll overview the basic plot types: line plots, scatter plots, and histograms.
Line graphs are plots where a line is drawn to indicate a relationship between a particular set of x and y values.
import matplotlib.pyplot as plt
To be able to plot anything, we need to provide the data points, so we declare those as follows:
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Using matplotlib
, we can plot a line between plot x and y.
plt.plot(x, y)
And as always, we use the show()
method to have the visualizations pop up.
plt.show()
Alternatively, you might want to plot quantities with 2 positions as data points. To do this, you first have to import the needed libraries, as always. We'll be using the same data from before:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
Next, we plot it with the plt.plot()
method. Note that the o
denotes the type of graph we're plotting.
plt.plot(x, y, 'o')
As always, let's look at what we made:
plt.show()
Histograms are very often used in science applications and it's highly likely that you will need to plot them at some point. They are very useful to plot distributions. As before, we'll use numpy
and matplotlib
.
import numpy as np
import matplotlib.pyplot as plt
First, we'll make the data to plot. We're going to make a normal distribution with 1000 points.
data = np.random.normal(5.0, 3.0, 1000)
Now, we actually make that histogram of the data array and attach a label:
plt.hist(data)
Lastly, let's look at what we've made:
plt.show()
The default customization for matplotlib isn't very appealing or even helpful in data visualization tasks.
When there are multiple data points or objects, they have to be able to be differentiated between one another. An easy way to do that is by using different marker styles and colors. You can do this by editing the third parameter to include a letter that indicates the color, such as:
plt.plot(x, y, "r")
This will give you the same line as before, but now it'll be red.
You can also customize the style of the your lines and markers. With line graphs, you can change the line to be dotted, dashed, etc, for example the following should give you a dashed line now:
plt.plot(x,y, "--")
You can find other linestyles you can use can be found on the [Matplotlib webpage](http:// matplotlib.sourceforge.net/api/pyplot)
With Scatter Plots, you can customize the dots to be squares, pentagons, etc. This will get you the a scatter plot with star markers:
plt.plot(x,y, "*")
We want to always label the axes of plots to tell users what they're looking at. You can do this in matplotlib, very easily:
If we want to attach a label on the x-axis, we can do that with the xlabel()
function:
plt.xlabel("X Axis")
With a quick modification, we can also do that for the y-axis:
plt.ylabel("Y Axis")
What good is a visualization without a title to let us know what the visualization is showing? Luckily, matplotlib has a built in function for that:
plt.title("Title Here")
Lastly, we can even customize the range for the x and y axes:
plt.xlim(0, 10)
plt.ylim(0, 10)
seaborn is a library for making statistical graphics in Python. It's built on top of matplotlib and is tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.
import matplotlib.pyplot as plt
import numpy as np
from urllib.request import urlretrieve
import pandas as pd
url = "https://gist.githubusercontent.com/jhamrick/cfa18fcd3032ba435ec78a194b1447be/raw/4a4052c56161df8e454a61ab5286a769799c64b8/task_data.csv"
urlretrieve(url, "task_data.csv")
task_data = pd.read_csv("task_data.csv")
task_data.head()
fig, ax = plt.subplots()
Plot the bars:
tasks = task_data.groupby(['robot', 'inference'])['robot_tasks'].mean()
ax.bar(np.arange(len(tasks)), tasks, align='center')
Show the 50% mark, which would indicate an equal number of tasks being completed by the robot and the human. There are 39 tasks total, so 50% is 19.5
ax.hlines(19.5, -0.5, 5.5, linestyle='--', linewidth=1)
Set a reasonable y-axis limit
ax.set_ylim(0, 40)
Apply labels to the bars so you know which is which
ax.set_xticks(np.arange(len(tasks)))
ax.set_xticklabels(["\n".join(x) for x in tasks.index])
Now let's see our work!
plt.show()
And we get:
Now, let's import seaborn and see what happens.
import seaborn as sns
Now, if we re-run the code, we'll see the different default graph for seaborn:
fig, ax = plt.subplots()
ax.bar(np.arange(len(tasks)), tasks, align='center')
ax.hlines(19.5, -0.5, 5.5, linestyle='--', linewidth=1)
ax.set_ylim(0, 40)
ax.set_xticks(np.arange(len(tasks)))
ax.set_xticklabels(["\n".join(x) for x in tasks.index])
plt.show()
And we get:
The barplot function gives us a legend telling us which color corresponds to which inference type. But, for this plot, I'd actually like to put the inference types under the bars as ticklabels, and then label each group of bars with the robot type. I can accomplish this by splitting the plot into three subplots, which is quite easy to do using Seaborn's FacetGrid:
g = sns.FacetGrid(
task_data,
col="robot",
col_order=["fixed", "reactive", "predictive"],
sharex=False)
# Create the bar plot on each subplot
g.map(
sns.barplot,
"robot", "robot_tasks", "inference",
hue_order=["oracle", "bayesian"])
# Now I need to draw the 50% lines on each subplot
# separately
axes = np.array(g.axes.flat)
for ax in axes:
ax.hlines(19.5, -0.5, 0.5, linestyle='--', linewidth=1)
ax.set_ylim(0, 40)
plt.gcf()
plt.show()
So we get:
Ggplot is yet another module we can use for data visualizations in Python.
from ggplot import *
ggplot(aes(x='date', y='beef'), data=meat) +\
geom_line() +\
stat_smooth(colour='blue', span=0.2)
Brought to you by Lesley Cordero, Byte Academy, and ADI.
This guide was written in Python 3.5.
We'll be needing different modules. Let's install the ones we'll need for this tutorial. Open up your terminal and enter the following commands to install the needed python modules:
pip3 install ggplot
pip3 install seaborn
pip3 install requests
pip3 install bokeh
pip3 install Image
pip3 install wordcloud
If you'd like to work in a virtual environment, you can set it up as follows:
pip3 install virtualenv
virtualenv your_env
And then launch it with:
source your_env/bin/activate
To execute the visualizations in matplotlib, do the following:
cd ~/.matplotlib
vim matplotlibrc
And then, write backend: TkAgg
in the file. Now you should be set up with your virtual environment!
Cool, now we're ready to start!
Unique data visualizations are more memorable. Most of the visualizations we've seen so far are kind of boring, especially if you have multiple of them. You're very quickly going to lose the attention of your audience. In this tutorial, we'll get into more specific and unique approaches.
It's important to note that while these approaches definitely have applications, they're not as commonly used as bar graphs, line graphs, scatter plots, or even histograms. These basic visualization types can still be the best choice for certain, straightforward data stories. However, communicating complex topics — hierarchies, longitudinal data, and multi-variable comparisons, and so on — often involves more advanced visualizations. That's our focus for this tutorial.
But before we go into code, let's overview some strategies for effective data visualization.
Very often, when it comes to plotting, less is more. Data usually looks better "naked". What does this mean?
-
Remove backgrounds
-
Remove redundant labels
-
Lighten your lines, or remove them!
-
Reduce unnecessary colors
-
Remove special effects
The default library, matplotlib, is fairly ugly, as we've reviewed before. This is why using modules like seaborn, bokeh, etc, are so popular - their default color schemes are much more appealing.
Also, keep in mind that many people are color blind, so choosing your color schemes is incredibly important so that your work is accessible to everyone.
Recall, seaborn is a Python data visualization library with an emphasis on statistical plots. The library is an excellent resource for common regression and distribution plots, but where Seaborn really shines is in its ability to visualize many different features at once.
To showcase Seaborn, we'll use the UCI "Auto MPG" data set. First, we'll need to do a bit of pre-processing using pandas. We import the needed modules, as follows:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('darkgrid')
In names, we store the names of the columns we'll be working with:
names = [
'mpg'
, 'cylinders'
, 'displacement'
, 'horsepower'
, 'weight'
, 'acceleration'
, 'model_year'
, 'origin'
, 'car_name'
]
Next, we read in the information from the CSV by feeding it in the link of the dataset and the column names:
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", sep='\s+', names=names)
We make a 'maker' column by taking the car_name column and taking the first word:
df['maker'] = df.car_name.map(lambda x: x.split()[0])
Next, we transform the numerical values in the origin column to categorical values:
df.origin = df.origin.map({1: 'America', 2: 'Europe', 3: 'Asia'})
Then, we drop the rows with missing values. Notice that now we have 392 rows instead of 398.
df = df.applymap(lambda x: np.nan if x == '?' else x).dropna()
And lastly, we convert the numbers in horsepower to floats and show what we've got:
df['horsepower'] = df.horsepower.astype(float)
df.head()
Now, we can begin our visualizations!
In seaborn
, you can easily build conditional plots, which allows us to see what the data looks like when segmented by one or more variables We can do this with factorplots. If we wanted to see how cars' MPG has varied over time, we could see this in aggregate with:
sns.factorplot(data=df, x="model_year", y="mpg")
plt.show()
But we can also segment by, say, region of origin:
sns.factorplot(data=df, x="model_year", y="mpg", col="origin")
plt.show()
What's so great factorplot is that rather than having to segment the data ourselves and make the conditional plots individually, Seaborn provides a convenient API for doing it all at once.
The FacetGrid object is a slightly more complex, but also more powerful, take on the same idea. Let's say that we wanted to see KDE plots of the MPG distributions, separated by country of origin:
g = sns.FacetGrid(df, col="origin")
g.map(sns.distplot, "mpg")
plt.show()
Or let's say that we wanted to see scatter plots of MPG against horsepower with the same origin segmentation:
g = sns.FacetGrid(df, col="origin")
g.map(plt.scatter, "horsepower", "mpg")
plt.show()
Using FacetGrid, we can map any plotting function onto each segment of our data. For example, above we gave plt.scatter to g.map, which tells Seaborn to apply the matplotlib plt.scatter function to each of segments in our data. We don't need to use plt.scatter, though; we can use any function that understands the input data. For example, we could draw regression plots instead:
g = sns.FacetGrid(df, col="origin")
g.map(sns.regplot, "horsepower", "mpg")
plt.xlim(0, 250)
plt.ylim(0, 60)
plt.show()
We can even segment by multiple variables at once, spreading some along the rows and some along the columns. This is very useful for producing comparing conditional distributions across interacting segmentations:
df['tons'] = (df.weight/2000).astype(int)
g = sns.FacetGrid(df, col="origin", row="tons")
g.map(sns.kdeplot, "horsepower", "mpg")
plt.xlim(0, 250)
plt.ylim(0, 60)
plt.show()
While factorplot and FacetGrid are for drawing conditional plots of segmented data, pairplot and PairGrid are for showing the interactions between variables. For our car data set, we know that MPG, horsepower, and weight are probably going to be related; we also know that both these variable values and their relationships with one another, might vary by country of origin. Let's visualize all of that at once:
g = sns.pairplot(df[["mpg", "horsepower", "weight", "origin"]], hue="origin", diag_kind="hist")
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=45)
plt.show()
As FacetGrid was a fuller version of factorplot, so PairGrid gives a bit more freedom on the same idea as pairplot by letting you control the individual plot types separately. Let's say, for example, that we're building regression plots, and we'd like to see both the original data and the residuals at once. PairGrid makes it easy:
g = sns.PairGrid(df[["mpg", "horsepower", "weight", "origin"]], hue="origin")
g.map_upper(sns.regplot)
g.map_lower(sns.residplot)
g.map_diag(plt.hist)
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=45)
g.add_legend()
g.set(alpha=0.5)
plt.show()
We were able to control three regions (the diagonal, the lower-left triangle, and the upper-right triangle) separately. Again, you can pipe in any plotting function that understands the data it's given.
The final Seaborn objects we'll talk about are jointplot and JointGrid; these features let you easily view both a joint distribution and its marginals at once. Let's say, for example, that aside from being interested in how MPG and horsepower are distributed individually, we're also interested in their joint distribution:
sns.jointplot("mpg", "horsepower", data=df, kind='kde')
plt.show()
As before, JointGrid gives you a bit more control by letting you map the marginal and joint data separately. For example:
g = sns.JointGrid(x="horsepower", y="mpg", data=df)
g.plot_joint(sns.regplot, order=2)
g.plot_marginals(sns.distplot)
plt.show()
Bokeh is a python interactive visualization library that targets modern web browsers for presentation. We'll begin with a simple visualization using the built-in iris data:
from bokeh.sampledata.iris import flowers
from bokeh.plotting import figure, show, output_file
This configures the default output state to generate output saved to a file when show()
is called.
output_file("iris.html", title ="iris.py example")
This attaches color labels to the species:
colormap = {'setosa': 'red', 'versicolor': 'green', 'virginica': 'blue'}
colors = [colormap[x] for x in flowers['species']]
This attaches a title to the visualization we're working on.
p = figure(title = "Iris Morphology")
Finally, we're adding the data!
p.circle(flowers["petal_length"], flowers["petal_width"],
color=colors, fill_alpha=0.2, size=10)
Finally, let's take a look:
show(p)
This is what we get! An awesome visualization:
Let's take it up a notch though.
Slopegraphs are a special type of line chart where two or more sets of values are compared by connecting each group's values on one scales to their values on the second scale. To make this more clear, let's look at an example:
For reference, you typically use these to compare a groups' rate of change or rank-order between categories.
Circle packing diagrams show groups as tightly-organized circles, and are often used to show hierarchies where smaller groups are either colored similarly to others in the same category, or nested within larger groups.
The packcircles
package provides an experimental version of the algorithm via the function circleGraphLayout. To use it, first encode the graph of tangencies as a list of vectors:
library(packcircles)
library(ggplot2)
internal <- list(
c(9, 4, 5, 6, 10, 11),
c(5, 4, 8, 6, 9),
c(8, 3, 2, 7, 6, 5, 4),
c(7, 8, 2, 1, 6)
)
Next, we specify the sizes of all external circles:
external <- data.frame(id = c(1, 2, 3, 4, 6, 10, 11), radius=10.0)
Now, pass these two objects to the circleGraphLayout function which will attempt to find a corresponding arrangement of circles which we can then be plotted:
layout <- circleGraphLayout(internal, external)
Here, we generate circle vertices from the layout for plotting:
plotdat <- circlePlotData(layout, xyr.cols = 2:4, id.col = 1)
And lastly, we draw the circles annotated with their IDs.
ggplot() +
geom_polygon(data=plotdat, aes(x, y, group=id), fill=NA, colour="brown") +
geom_text(data=layout, aes(x, y, label=id)) +
coord_equal() +
theme_bw()
This gets us:
Not the best looking thing ever, but it gets the general idea.
Sunbursts show a hierarchical structure in a circular layout, with each ring outward representing a deeper level of the hierarchy.
While sunbursts share some of the disadvantages of pie charts and are not well-suited to precise size comparisons, they do allow notable segments of a complex, multi-layered hierarchy to be quickly identified to guide further action.
A word cloud is a visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text.
We begin by importing the needed python modules:
import os
from wordcloud import WordCloud
Recall that we can use the os
module to read the text to generate the word cloud:
d = os.getcwd()
text = open(os.path.join(d, 'constitution.txt')).read()
Now, we actually generate a word cloud with the generate()
method:
wordcloud = WordCloud().generate(text)
Using matplotlib, we'll now generate the actual image:
import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
You can customize the font size:
wordcloud = WordCloud(max_font_size=40).generate(text)
And finally, let's see what we've made:
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
This is what it looks like:
And that wraps up the Advanced Data Visualization portion of this course! As you can see, there are plenty of methods you can choose to visualize your data. This process definitely requires technical skills, but also a bit of creativity and intuition.
Next up, we'll go into Exploratory Analysis with Data Visualization.
Obviously this tutorial doesn't encompass everything there is to learn about Data Visualization. Here are some resources I recommend for learning more: