Tables

Tables show up in most data science programming languages; R language is no exception. Its useful to visualize a table as a single-page spreadsheet, with columns of data and a header row of column names.

Here we load an example and print it (by simply typing its name).

iris = read_iris()
iris

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

You’ll note that it prints that it is a “tibble”. That’s a pun, I guess, on saying “table” 10 times fast.

There’s so much you can do with tables (like tibbles) - learn more here

Anyway, let’s identify the essential parts of a table.

Table parts

Below you can see that each row is considered a “record” or “observation” while each column is considered a “variable” or “field”. In addition we can see the data types. Row number are not actually part of the data, but they are printed as a visual convenience.

Figure: Essential table parts

Other data types you may encounter include…

dbl A floating point number like 7.1 or 3.14
int A whole integer like -14, 0 and 1001
chr A string- text- or character-type. They all mean the same thing.
fct A factor, which is like a group membership
Date A date to the nearest whole day
dtm A date and time to the nearset, oh, geez, maybe microsecond?
list A list is a special container type - it can hold any old mix of things.

You may encounter other data types along the way.

Selecting table columns

For more detailed info see ?select.

Often you only need a few of the variables (columns) in a table. To pick out the desired columns you select them.

Think of selection as what you do at a buffet: you choose the wings and the rice, but you don’t choose the kale and broccoli.

Two common ways to select are to (a) simply identify the columns by name or (b) provide a matching function that identifies the desired columns.

This is the first way - just name the columns you want.

iris |> select(Sepal.Width, Petal.Width)

## # A tibble: 150 × 2
##    Sepal.Width Petal.Width
##          <dbl>       <dbl>
##  1         3.5         0.2
##  2         3           0.2
##  3         3.2         0.2
##  4         3.1         0.2
##  5         3.6         0.2
##  6         3.9         0.4
##  7         3.4         0.3
##  8         3.4         0.2
##  9         2.9         0.2
## 10         3.1         0.1
## # ℹ 140 more rows

And here is the way to do it with a matching function. Here the matching function is looking at the end of each column name - looking for the pattern “Width”.

iris |> select( ends_with("Width") )

## # A tibble: 150 × 2
##    Sepal.Width Petal.Width
##          <dbl>       <dbl>
##  1         3.5         0.2
##  2         3           0.2
##  3         3.2         0.2
##  4         3.1         0.2
##  5         3.6         0.2
##  6         3.9         0.4
##  7         3.4         0.3
##  8         3.4         0.2
##  9         2.9         0.2
## 10         3.1         0.1
## # ℹ 140 more rows

You can even mix-and-match these tricks. Here we simply rearrange the order of the columns using a series of selects.

iris |> select(Species,               # one per line
               ends_with("Width"),    # to improve
               ends_with("Length"))   # readability

## # A tibble: 150 × 5
##    Species Sepal.Width Petal.Width Sepal.Length Petal.Length
##    <fct>         <dbl>       <dbl>        <dbl>        <dbl>
##  1 setosa          3.5         0.2          5.1          1.4
##  2 setosa          3           0.2          4.9          1.4
##  3 setosa          3.2         0.2          4.7          1.3
##  4 setosa          3.1         0.2          4.6          1.5
##  5 setosa          3.6         0.2          5            1.4
##  6 setosa          3.9         0.4          5.4          1.7
##  7 setosa          3.4         0.3          4.6          1.4
##  8 setosa          3.4         0.2          5            1.5
##  9 setosa          2.9         0.2          4.4          1.4
## 10 setosa          3.1         0.1          4.9          1.5
## # ℹ 140 more rows

Wooooo!

Filtering table rows

For more info see ?filter.

Filtering a table means you are chosing certain rows based upon one or more restrictions you establish. These restrictions are one of more TRUE/FALSE vectors that chooses (TRUE) or skips (FALSE) a given row.

Below we filter the iris table so that we have only the rows where Sepal.Length is less than 5 (cm?).

iris |> filter(Sepal.Length < 5)

## # A tibble: 22 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          4.9         3            1.4         0.2 setosa 
##  2          4.7         3.2          1.3         0.2 setosa 
##  3          4.6         3.1          1.5         0.2 setosa 
##  4          4.6         3.4          1.4         0.3 setosa 
##  5          4.4         2.9          1.4         0.2 setosa 
##  6          4.9         3.1          1.5         0.1 setosa 
##  7          4.8         3.4          1.6         0.2 setosa 
##  8          4.8         3            1.4         0.1 setosa 
##  9          4.3         3            1.1         0.1 setosa 
## 10          4.6         3.6          1           0.2 setosa 
## # ℹ 12 more rows

We can make compound filters, say by excluding the setosa species using the “not” operate, which in R is !. Fun fact, ! is often pronounced “bang”.

iris |> filter(Sepal.Length < 5, Species != "setosa")

## # A tibble: 2 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          4.9         2.4          3.3         1   versicolor
## 2          4.9         2.5          4.5         1.7 virginica

Slicing table rows

For more info see ?slice.

One other common row choosing tool is “slicing”. Slicing doesn’t use logical restriction to find the rows you want. Instead, slice is less elegant (but still very handy!) because you must provide the row numbers to slice out.

Here we slice the 3rd, 5th and 7th rows,

iris |> slice(c(3,5,7))

## # A tibble: 3 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          4.7         3.2          1.3         0.2 setosa 
## 2          5           3.6          1.4         0.2 setosa 
## 3          4.6         3.4          1.4         0.3 setosa

But we could also slice out a contiguous block of rows.

iris |> slice(15:20)

## # A tibble: 6 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.8         4            1.2         0.2 setosa 
## 2          5.7         4.4          1.5         0.4 setosa 
## 3          5.4         3.9          1.3         0.4 setosa 
## 4          5.1         3.5          1.4         0.3 setosa 
## 5          5.7         3.8          1.7         0.3 setosa 
## 6          5.1         3.8          1.5         0.3 setosa

Be sure to check out the siblings of slice including slice_min(), slice_max(), slice_sample(), slice_head() and slice_tail().

Plotting tables

For more info see ?ggplot2 and ggplot2 resources.

There are a number of ways to make graphics (aka plots) with tabular data. We prefer to use ggplot2 resources. ggplot2 uses a system of layering graphical elements to build up a plot. Every ggplot2 construct follows the same building pattern you see below. The + between layers literally adds the next layer to what you have already created.

a_base_layer() +
  some_layer_of_points() +
  maybe_a_layer_of_lines() +
  oh_how_about_pretty_theme_controls() + 
  extra_annotations_I_want_to_add() +
  fiddle_with_the_axes()

The base layer

You usually (but not always!) provide the base layer with the core data and some clues as to how the data variables get “mapped” to the elements of the plot. Mapping provides a road map to answer questions like “who goes on the x-axis?”, “who is on the y-axis?” or “is there any grouping of data to establish colors or point shapes?”.

Let’s start with a simple scatter plot. First the base layer. You’ll note that we specify the mapping using an intermediary funtion called aes() which is short for aesthetics.

ggplot(data = iris,
       mapping = aes(x = Sepal.Length, y = Sepal.Width))

OK! Something happend! But there’s no data. That’s because there is wide range of plot types and ggplot2 expects us to decide what. So let’s add a layer of plot symbols. When adding shapes to a plot (points, lines, text, etc) we will add geometries. So to add points, we add geom_point().

ggplot(data = iris,
       mapping = aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point()

Ah, there we are. Now it would be nice if the points were colored by Species. We edit the mapping to add a color aesthetic.

ggplot(data = iris,
       mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point()

Aha! We not only gain coloration by Species, but also a legend (sometimes called a key). In fact, we can specify more aesthetic mappings (which may or may not change the legend).

ggplot(data = iris,
       mapping = aes(x = Sepal.Length, y = Sepal.Width, 
                     color = Species, shape = Species)) + 
  geom_point()

We can also split the three groups into separate plots - this is called faceting, like looking at a gem through it’s different faces (facets).

ggplot(data = iris,
       mapping = aes(x = Sepal.Length, y = Sepal.Width, 
                     color = Species, shape = Species)) + 
  geom_point() + 
  facet_wrap(~Species)

Finally, we could add a caption for the data and a title.

ggplot(data = iris,
       mapping = aes(x = Sepal.Length, y = Sepal.Width, 
                     color = Species, shape = Species)) + 
  geom_point() + 
  facet_wrap(~Species) +
  labs(caption = "Attribution: E. Anderson 1935",
       title = "Fun with Iris Plants")