-
Notifications
You must be signed in to change notification settings - Fork 0
/
viz_gg.qmd
309 lines (211 loc) · 11.4 KB
/
viz_gg.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
title: "Visualizing Data"
execute:
warning: false
---
## Grammar of Graphics
Both [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r} have a plotting library based on [The Grammar of Graphics](https://link.springer.com/book/10.1007/0-387-28695-0) by Leland Wilkinson. The [{{< fa brands python >}} Python]{.py} package (`plotnine`) is actually directly based on the [{{< fa brands r-project >}} R]{.r} package (`ggplot2`) so their internal syntax is _very_ similar. In fact the only serious differences between the two languages' ggplot operations are those that derive from larger syntax and format differences.
Note that in the following examples we **_will not_** namespace [{{< fa brands r-project >}} R]{.r} `ggplot2` functions (e.g., `ggplot2::aes`) for convenience. Any function not namespaced in the R examples producing graphs can be assumed to be exported from `ggplot2`.
## Library & Data Loading
Begin by loading any needed libraries and reading in an external data file for use in downstream examples.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Load the `ggplot2` and `dplyr` libraries as well as our vertebrate data.
```{r r-start, warning = F, message = F}
# Load needed library
library(ggplot2)
library(dplyr)
# Load data
vert_r <- utils::read.csv(file = file.path("data", "verts.csv"))
# Keep only rows where species and year are *not* NA
complete_r <- dplyr::filter(.data = vert_r,
!is.na(species) & nchar(species) != 0 &
!is.na(year) & nchar(year) != 0)
# Group data by species and year
grp_r <- dplyr::group_by(.data = complete_r,
year, species)
# Average weight by species and year
avg_r <- dplyr::summarize(.data = grp_r,
mean_wt = mean(weight_g, na.rm = T))
# Check out first few rows
head(avg_r, n = 5)
```
Remember that we use an `!` in [{{< fa brands r-project >}} R]{.r} to negate a conditional masking function like `is.na`.
Note that the `summarize` function drops all columns that either it doesn't create or that are not used as grouping variables.
## [{{< fa brands python >}} Python]{.py}
Load the `plotnine`, `os`, and `pandas` libraries as well as our vertebrate data.
```{python py-libs}
# Load needed library
import os
import plotnine as p9
import pandas as pd
# Load data
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))
# Keep only rows where species and year are *not* NA
complete_py = vert_py[(~pd.isnull(vert_py["species"])) & (~pd.isnull(vert_py["year"]))]
# Group data by species and year
grp_py = complete_py.groupby(["year", "species"])
# Average weight by species and year
avg_py = grp_py["weight_g"].mean().reset_index(name = "mean_wt")
# Check out first few rows
avg_py.head()
```
Remember that we use a `~` in [{{< fa brands python >}} Python]{.py} to negate a conditional masking function like `isnull`.
:::
## Core Components
There are three fundamental components to ggplots:
1. <u>Data</u> [variable(s)]{.py}/[object(s)]{.r} used in the graph
2. <u>Aesthetics</u> (i.e., which column [labels]{.py}/[names]{.r} are assigned to graph components)
3. <u>Geometries</u> (i.e., defining the type of plot)
### Data & Aesthetics
We can create an empty graph with correctly labeled axes but without any data by defining the data and aesthetics but neglecting to include any geometry. Make a graph where year is on the X-axis (horizontal) and mean weight is on the Y-axis (vertical).
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Column names _cannot be_ in quotes in the `aes` function.
```{r r-gg1}
# Create graph
ggplot(data = avg_r, mapping = aes(x = year, y = mean_wt))
```
## [{{< fa brands python >}} Python]{.py}
Column labels supplies to the `aes` function _need to be_ in quotes.
```{python py-gg1}
#| warning: false
# Create graph
(p9.ggplot(data = avg_py, mapping = p9.aes(x = "year", y = "mean_wt")))
```
Note that we need to wrap our Python ggplot in parentheses to avoid errors.
:::
As we alluded to above, the `ggplot` function with data and mapped aesthetics is enough to create the correct axis labels and tick marks but doesn't actually put our data on the graphing area. For that, we'll need to add a geometry.
### Geometries
All geometry functions--in either language--take the form of `geom_*` where `*` is name of the desired chart type (e.g., `geom_line` adds a line, `geom_bar` adds bars, etc.). In order to add geometries onto our plot--again, in either language--we use the `+` operator. Note that style guides suggest ending each line of a ggplot with a `+` and including each new component as their own line below. This keeps even very complicated graphs relatively human-readable.
Let's make these graphs into scatter plots by adding a point geometry.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
```{r r-gg2}
# Create graph
ggplot(data = avg_r, mapping = aes(x = year, y = mean_wt)) +
geom_point()
```
## [{{< fa brands python >}} Python]{.py}
```{python py-gg2}
#| warning: false
# Create graph
(p9.ggplot(data = avg_py, mapping = p9.aes(x = "year", y = "mean_wt")) +
p9.geom_point()
)
```
:::
Note that in either language the `geom_point` function does not need either data or aesthetics because it "inherits" them from the `ggplot` function! You _can_ specify aesthetics (or data!) for a particular geometry but it is simpler to specify it once if you're okay with all subsequent plot components using the same data/aesthetics.
Let's practice a little further by making the color of the points dependent upon species.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
```{r r-gg3}
# Create graph
ggplot(data = avg_r, mapping = aes(x = year, y = mean_wt)) +
geom_point(mapping = aes(color = species))
```
Note that we could specify the color aesthetic in the `ggplot` aesthetics!
## [{{< fa brands python >}} Python]{.py}
```{python py-gg3}
#| warning: false
# Create graph
(p9.ggplot(data = avg_py, mapping = p9.aes(x = "year", y = "mean_wt")) +
p9.geom_point(mapping = p9.aes(color = "species"))
)
```
Note that we could specify the color aesthetic in the `ggplot` aesthetics!
:::
### Iterative Revision
One of the real strengths of ggplots is that you can preserve part of your ideal graph as a [variable]{.py}/[object]{.r} and then add to it later. This saves you from needing to re-type a consistent `ggplot` function when all you really want to do is experiment with different geometries
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Create the top level of the graph and assign it to an object. Then--separately--add a line geometry.
```{r r-gg4}
# Create graph
gg_r <- ggplot(data = avg_r, mapping = aes(x = year, y = mean_wt, color = species))
# Add the line geometry
gg_r +
geom_line()
```
## [{{< fa brands python >}} Python]{.py}
Create the top level of the graph and assign it to a variable. Then--separately--add a line geometry.
```{python py-gg4}
#| warning: false
# Create graph
gg_py = p9.ggplot(data = avg_py, mapping = p9.aes(x = "year", y = "mean_wt", color = "species"))
# Add the line geometry
(gg_py +
p9.geom_line())
```
:::
## Customizing Themes
Once you have a graph that has the desired content mapped to various aesthetics and uses the geometry that you want, it's time to dive into the optional fourth component of grammar of graphics plots: themes! All plot format components from the size of the font in the axes to the gridline width are controlled by theme elements.
To emphasize the theme modification examples below, let's assign all components of the above graph into a new [variable]{.py}/[object]{.r}.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
```{r r-gg5}
# Make the line graph object
line_r <- gg_r +
geom_line()
```
## [{{< fa brands python >}} Python]{.py}
```{python py-gg5}
#| warning: false
# Make the line graph variable
line_py = gg_py + p9.geom_line()
```
:::
### Built-In Themes
To begin, [plotnine]{.py}/[ggplot2]{.r} both come with pre-built themes that change a swath of theme elements all at once. If one of these fits your visualization needs then you don't need to worry about customizing the nitty gritty yourself which can be a real time-saver.
Let's add the built-in 'black and white' theme to our existing graph using the `theme_bw` function.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
```{r r-gg6}
# Add the black and white theme
line_r +
theme_bw()
```
## [{{< fa brands python >}} Python]{.py}
```{python py-gg6}
#| warning: false
# Add the black and white theme
(line_py +
p9.theme_bw())
```
:::
### Fully Custom Themes
If we'd rather, we can use the `theme` function and manually specify particular elements to change ourselves! Each element requires a helper function that matches the category of element beind edited. For instance, text elements get changed with `element_text()` while line elements with `element_line`. When we want to _remove_ an element we can use `element_blank`. Let's increase the font size for our axis tick labels and titles.
We'll also use the `labs` function to customize our axis titles slightly.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Note that the `theme` arguments use periods (`.`) between words.
```{r r-gg7}
# Customize theme more fully
line_r +
labs(x = "Year", y = "Average Weight (g)") +
theme(panel.background = element_blank(),
axis.line = element_line(color = "black"),
axis.title = element_text(size = 18),
axis.text = element_text(size = 14))
```
## [{{< fa brands python >}} Python]{.py}
Note that the `theme` arguments use underscores (`_`) between words to be consistent with [{{< fa brands python >}} Python]{.py} syntax.
```{python py-gg7}
#| warning: false
# Customize theme more fully
(line_py +
p9.labs(x = "Year", y = "Average Weight (g)") +
p9.theme(panel_background = p9.element_blank(),
axis_line = p9.element_line(color = "black"),
axis_title = p9.element_text(size = 18),
axis_text = p9.element_text(size = 14)))
```
:::
## Continuing to Explore
This lesson was designed to showcase the similarity between [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r}, _not_ to provide an exhaustive primer on all things ggplot. There are a lot of really cool graphs you can make with these tools and hopefully this website makes you feel better prepared to translate the knowledge you have from one language into the other!
If you are new to `ggplot`, I recommend searching out "faceting" graphs in particular as this can be a particularly powerful tool when you have many groups within your data [variable]{.py}/[object]{.r}.
### Additional Resources
- [{{< fa brands python >}} Python]{.py} -- The Carpentries covers `plotnine` in their ['Python for Ecologists' lesson](https://datacarpentry.org/python-ecology-lesson/07-visualization-ggplot-python.html)
- [{{< fa brands r-project >}} R]{.r} -- NCEAS Scientific Computing team has a nice [Tidyverse Workshop](https://nceas.github.io/scicomp-workshop-tidyverse/visualize.html) with a particular emphasis on `ggplot2`
- [{{< fa brands r-project >}} R]{.r} -- The Carpentries also covers `ggplot2` in their ['R for Ecologists' lesson](https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html)
- [{{< fa brands r-project >}} R]{.r} -- [Sam Csik](https://samanthacsik.github.io/) designed a ["Data Visualization & Communication"](https://samanthacsik.github.io/EDS-240-data-viz/course-materials/week5.html) course for the UCSB Master of Environmental Data Science (MEDS) program