-
Notifications
You must be signed in to change notification settings - Fork 0
Tables
Tables show up in most data science programming languages; R language is no exception. Its useful to visualize a table as a single-page spreadsheet, with columns of data and a header row of column names.
Here we load an example and print it (by simply typing its name).
iris = read_iris()
iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
You’ll note that it prints that it is a “tibble”. That’s a pun, I guess, on saying “table” 10 times fast.
There’s so much you can do with tables (like tibbles) - learn more here
Anyway, let’s identify the essential parts of a table.
Below you can see that each row is considered a “record” or “observation” while each column is considered a “variable” or “field”. In addition we can see the data types. Row number are not actually part of the data, but they are printed as a visual convenience.
Figure: Essential table parts
Other data types you may encounter include…
-
dbl
A floating point number like 7.1 or 3.14 -
int
A whole integer like -14, 0 and 1001 -
chr
A string- text- or character-type. They all mean the same thing. -
fct
A factor, which is like a group membership -
Date
A date to the nearest whole day -
dtm
A date and time to the nearset, oh, geez, maybe microsecond? -
list
A list is a special container type - it can hold any old mix of things.
You may encounter other data types along the way.
For more detailed info see ?select
.
Often you only need a few of the variables (columns) in a table. To pick
out the desired columns you select
them.
Think of selection as what you do at a buffet: you choose the wings and the rice, but you don’t choose the kale and broccoli.
Two common ways to select are to (a) simply identify the columns by name or (b) provide a matching function that identifies the desired columns.
This is the first way - just name the columns you want.
iris |> select(Sepal.Width, Petal.Width)
## # A tibble: 150 × 2
## Sepal.Width Petal.Width
## <dbl> <dbl>
## 1 3.5 0.2
## 2 3 0.2
## 3 3.2 0.2
## 4 3.1 0.2
## 5 3.6 0.2
## 6 3.9 0.4
## 7 3.4 0.3
## 8 3.4 0.2
## 9 2.9 0.2
## 10 3.1 0.1
## # ℹ 140 more rows
And here is the way to do it with a matching function. Here the matching function is looking at the end of each column name - looking for the pattern “Width”.
iris |> select( ends_with("Width") )
## # A tibble: 150 × 2
## Sepal.Width Petal.Width
## <dbl> <dbl>
## 1 3.5 0.2
## 2 3 0.2
## 3 3.2 0.2
## 4 3.1 0.2
## 5 3.6 0.2
## 6 3.9 0.4
## 7 3.4 0.3
## 8 3.4 0.2
## 9 2.9 0.2
## 10 3.1 0.1
## # ℹ 140 more rows
You can even mix-and-match these tricks. Here we simply rearrange the order of the columns using a series of selects.
iris |> select(Species, # one per line
ends_with("Width"), # to improve
ends_with("Length")) # readability
## # A tibble: 150 × 5
## Species Sepal.Width Petal.Width Sepal.Length Petal.Length
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 3.5 0.2 5.1 1.4
## 2 setosa 3 0.2 4.9 1.4
## 3 setosa 3.2 0.2 4.7 1.3
## 4 setosa 3.1 0.2 4.6 1.5
## 5 setosa 3.6 0.2 5 1.4
## 6 setosa 3.9 0.4 5.4 1.7
## 7 setosa 3.4 0.3 4.6 1.4
## 8 setosa 3.4 0.2 5 1.5
## 9 setosa 2.9 0.2 4.4 1.4
## 10 setosa 3.1 0.1 4.9 1.5
## # ℹ 140 more rows
Wooooo!
For more info see ?filter
.
Filtering a table means you are chosing certain rows based upon one or more restrictions you establish. These restrictions are one of more TRUE/FALSE vectors that chooses (TRUE) or skips (FALSE) a given row.
Below we filter the iris
table so that we have only the rows where
Sepal.Length
is less than 5 (cm?).
iris |> filter(Sepal.Length < 5)
## # A tibble: 22 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.9 3 1.4 0.2 setosa
## 2 4.7 3.2 1.3 0.2 setosa
## 3 4.6 3.1 1.5 0.2 setosa
## 4 4.6 3.4 1.4 0.3 setosa
## 5 4.4 2.9 1.4 0.2 setosa
## 6 4.9 3.1 1.5 0.1 setosa
## 7 4.8 3.4 1.6 0.2 setosa
## 8 4.8 3 1.4 0.1 setosa
## 9 4.3 3 1.1 0.1 setosa
## 10 4.6 3.6 1 0.2 setosa
## # ℹ 12 more rows
We can make compound filters, say by excluding the setosa
species
using the “not” operate, which in R is !
. Fun fact, !
is often
pronounced “bang”.
iris |> filter(Sepal.Length < 5, Species != "setosa")
## # A tibble: 2 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.9 2.4 3.3 1 versicolor
## 2 4.9 2.5 4.5 1.7 virginica
For more info see ?slice
.
One other common row choosing tool is “slicing”. Slicing doesn’t use logical restriction to find the rows you want. Instead, slice is less elegant (but still very handy!) because you must provide the row numbers to slice out.
Here we slice the 3rd, 5th and 7th rows,
iris |> slice(c(3,5,7))
## # A tibble: 3 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.7 3.2 1.3 0.2 setosa
## 2 5 3.6 1.4 0.2 setosa
## 3 4.6 3.4 1.4 0.3 setosa
But we could also slice out a contiguous block of rows.
iris |> slice(15:20)
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.8 4 1.2 0.2 setosa
## 2 5.7 4.4 1.5 0.4 setosa
## 3 5.4 3.9 1.3 0.4 setosa
## 4 5.1 3.5 1.4 0.3 setosa
## 5 5.7 3.8 1.7 0.3 setosa
## 6 5.1 3.8 1.5 0.3 setosa
Be sure to check out the siblings of slice including slice_min()
,
slice_max()
, slice_sample()
, slice_head()
and slice_tail()
.
For more info see ?ggplot2
and ggplot2
resources.
There are a number of ways to make graphics (aka plots) with tabular
data. We prefer to use ggplot2
resources. ggplot2
uses a system of
layering graphical elements to build up a plot. Every ggplot2
construct follows the same building pattern you see below. The +
between layers literally adds the next layer to what you have already
created.
a_base_layer() +
some_layer_of_points() +
maybe_a_layer_of_lines() +
oh_how_about_pretty_theme_controls() +
extra_annotations_I_want_to_add() +
fiddle_with_the_axes()
You usually (but not always!) provide the base layer with the core data and some clues as to how the data variables get “mapped” to the elements of the plot. Mapping provides a road map to answer questions like “who goes on the x-axis?”, “who is on the y-axis?” or “is there any grouping of data to establish colors or point shapes?”.
Let’s start with a simple scatter plot. First the base layer. You’ll
note that we specify the mapping using an intermediary funtion called
aes()
which is short for aesthetics.
ggplot(data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width))
OK! Something happend! But there’s no data. That’s because there is wide
range of plot types and ggplot2
expects us to decide what. So let’s
add a layer of plot symbols. When adding shapes to a plot (points,
lines, text, etc) we will add geometries. So to add points, we add
geom_point()
.
ggplot(data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
Ah, there we are. Now it would be nice if the points were colored by Species. We edit the mapping to add a color aesthetic.
ggplot(data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()
Aha! We not only
gain coloration by Species
, but also a legend (sometimes called a
key). In fact, we can specify more aesthetic mappings (which may or may
not change the legend).
ggplot(data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width,
color = Species, shape = Species)) +
geom_point()
We can also split the three groups into separate plots - this is called faceting, like looking at a gem through it’s different faces (facets).
ggplot(data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width,
color = Species, shape = Species)) +
geom_point() +
facet_wrap(~Species)
Finally, we could add a caption for the data and a title.
ggplot(data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width,
color = Species, shape = Species)) +
geom_point() +
facet_wrap(~Species) +
labs(caption = "Attribution: E. Anderson 1935",
title = "Fun with Iris Plants")