-
-
Notifications
You must be signed in to change notification settings - Fork 3
/
part-01-07-egonets.qmd
419 lines (301 loc) · 15 KB
/
part-01-07-egonets.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
---
date-modified: 2024-05-27
---
# Egocentric networks
In egocentric social network analysis (ESNA, for our book,) instead of dealing with a single network, we have as many networks as participants in the study. Egos--the main study subjects--are analyzed from the perspective of their local social network. For a more extended view of ESNA, look at Raffaele Vacca's <a href="https://raffaelevacca.github.io/egocentric-r-book/" target="_blank">"*Egocentric network analysis with R*"</a>.
In this chapter, I show how to work with one particular type of ESNA data: information generated by the tool <a href="https://networkcanvas.com/" target="_blank">Network Canvas</a>. You can download an "artificial" ZIP file containing the outputs from a Network Canvas project [here](data-raw/networkCanvasExport-fake.zip)[^netcanvas-file]. We assume the ZIP file was extracted to the folder `data-raw/egonets`. You can go ahead and extract the ZIP by point-and-click or use the following R code to automate the process:
[^netcanvas-file]: I thank [Jacqueline M. Kent-Marvick](https://scholar.google.com/citations?user=Uht4YbkAAAAJ){target="_blank"}, who provided me with what I used as a baseline to generate the artificial Network Canvas export.
```{r echo=FALSE, warning=FALSE, message=FALSE}
# To make the file available
if (dir.exists("docs/data-raw"))
dir.create("docs/data-raw", recursive=TRUE)
file.copy(
"data-raw/networkCanvasExport-fake.zip",
"docs/data-raw/networkCanvasExport-fake.zip"
)
knitr::opts_chunk$set(collapse = TRUE)
```
```{r unzip}
unzip(
zipfile = "data-raw/networkCanvasExport-fake.zip",
exdir = "data-raw/egonets"
)
```
This will extract all the files in `networkCanvasExport-fake.zip` to the subfolder `egonets`. Let's take a look at the first few files:
```{r part-01-07-egonets-files}
head(list.files(path = "data-raw/egonets"))
```
As you can see, for each ego in the dataset, there are four files:
- `...attributeList_Person.csv`: Attributes of the alters.
- `...edgeList_Knows.csv`: Edgelist indicating the ties between the alters.
- `...ego.csv`: Information about the egos.
- `...graphml`: And a <a href="https://en.wikipedia.org/wiki/GraphML" target="_blank">`graphml` file</a> that contains the egonets.
The next sections will illustrate, file by file, how to read the information into R, apply any required processing, and store the information for later use. We start with the `graphml` files.
## Network files (graphml)
The `graphml` files can be read directly with `igraph`'s `read_graph` function. The key is to take advantage of R's lists to avoid writing over and over the same block of code, and, instead, manage the data through lists.
Just like any data-reading function, `read_graph` function requires a file path to the network file. **The function we will use to list the required files is `list.files()`**:
```{r netread, message=FALSE}
# We start by loading igraph
library(igraph)
# Listing all the graphml files
graph_files <- list.files(
path = "data-raw/egonets", # Where are these files
pattern = "*.graphml", # Specify a pattern for only listing graphml
full.names = TRUE # And we make sure we use the full name
# (path.) Otherwise, we would only get names.
)
# Taking a look at the first three files we got
graph_files[1:3]
# Applying igraph's read_graph
graphs <- lapply(
X = graph_files, # List of files to read
FUN = read_graph, # The function to apply
format = "graphml" # Argument passed to read_graph
)
```
If the operation succeeded, the previous code block should generate a list of `igraph` objects named `graphs`. Let's take a peek at the first two:
```{r}
graphs[[1]]
graphs[[2]]
```
As always, one of the first things we do with networks is visualize them. We will use the <a href="https://cran.r-project.org/package=netplot" target="_blank">`netplot` R package</a> (by yours truly) to draw the figures:
```{r plot-nets, message=FALSE, warning=FALSE}
library(netplot)
library(gridExtra)
# Graph layout is random
set.seed(1231)
# The grid.arrange allows putting multiple netplot graphs into the same page
grid.arrange(
nplot(graphs[[1]]),
nplot(graphs[[2]]),
nplot(graphs[[3]]),
nplot(graphs[[4]]),
ncol = 2, nrow = 2
)
```
Great! Since nodes in our network have features, we can add a little color. We will use the `eat_with_2` variable, coded as `TRUE` or `FALSE`. Vertex colors can be specified using the `vertex.color` argument of the `nplot` function. In our case, we will specify colors passing a vector with length equal to the number of nodes in the graph. Furthermore, since we will be doing this multiple times, it is worthwhile writing a function:
```{r plot-nets-colored}
# A function to color by the eat with variable
color_it <- function(net) {
# Coding eat_with_2 to be 1 (FALSE) or 2 (TRUE)
eatswith <- V(net)$eat_with_2
# Subsetting the color
ifelse(eatswith, "purple", "darkgreen")
}
```
This function takes two arguments: a network and a vector of two colors. Vertex attributes in `igraph` can be accessed through the `V(...)$...` function. For this example, to access the attribute `eat_with_2` in the network `net`, we type `V(net)$eat_with_2`. Finally, individuals with `eat_with_2` equal to true will be colored `purple`; otherwise, if equal to `FALSE`, they will be colored `darkgreen`. Before plotting the networks, let's see what we get when we access the `eat_with_2` attribute in the first graph:
```{r}
V(graphs[[1]])$eat_with_2
```
A logical vector. Now let's redraw the figures:
```{r part-01-07-plot-nets-colored}
grid.arrange(
nplot(graphs[[1]], vertex.color = color_it(graphs[[1]])),
nplot(graphs[[2]], vertex.color = color_it(graphs[[2]])),
nplot(graphs[[3]], vertex.color = color_it(graphs[[3]])),
nplot(graphs[[4]], vertex.color = color_it(graphs[[4]])),
ncol = 2, nrow = 2
)
```
Since most of the time, we will be dealing with many egonets; you may want to draw each network independently; the following code block does that. First, if needed, will create a folder to store the networks. Then, using the `lapply` function, it will use `netplot::nplot()` to draw the networks, add a legend, and save the graph as `.../graphml_[number].png`, where `[number]` will go from `01` to the total number of networks in `graphs`.
```{r plot-net-all, eval = FALSE}
if (!dir.exists("egonets/figs/egonets"))
dir.create("egonets/figs/egonets", recursive = TRUE)
lapply(seq_along(graphs), function(i) {
# Creating the device
png(sprintf("egonets/figs/egonets/graphml_%02i.png", i))
# Drawing the plot
p <- nplot(
graphs[[i]],
vertex.color = color_it(graphs[[i]])
)
# Adding a legend
p <- nplot_legend(
p,
labels = c("eats with: FALSE", "eats with: TRUE"),
pch = 21,
packgrob.args = list(side = "bottom"),
gp = gpar(
fill = c("darkgreen", "purple")
),
ncol = 2
)
print(p)
# Closing the device
dev.off()
})
```
## Person files
Like before, we list the files ending in `Person.csv` (with the full path,) and read them into R. While R has the function `read.csv`, here I use the function `fread` from the <a href="https://cran.r-project.org/package=data.table" target="_blank">`data.table`</a> R package. Alongside `dplyr`, `data.table` is one of the most popular data-wrangling tools in R. Besides syntax, the biggest difference between the two is performance; `data.table` is significantly faster than any other data management package in R, and is a great alternative for handling large datasets. The following code block loads the package, lists the files, and reads them into R.
```{r read-person}
# Loading data.table
library(data.table)
# Listing the files
person_files <- list.files(
path = "data-raw/egonets",
pattern = "*Person.csv",
full.names = TRUE
)
# Loading all into a single list
persons <- lapply(person_files, fread)
# Looking into the first element
persons[[1]]
```
A common task is adding an identifier to each dataset in `persons` so we know from to which ego they belong. Again, the `lapply` function is our friend:
```{r part-01-07-adding-ids}
persons <- lapply(seq_along(persons), function(i) {
persons[[i]][, dataset_num := i]
})
```
In `data.table`, variables are created using the `:=` symbol. The previous code chunk is equivalent to this:
```r
for (i in 1:length(persons)) {
persons[[i]]$dataset_num <- i
}
```
If needed, we can transform the list `persons` into a `data.table` object (i.e., a single `data.frame`) using the `rbindlist` function[^rbindlist]. The next code block uses that function to combine the `data.table`s into a single dataset.
[^rbindlist]: Although not the same, `rbindlist` (almost always) yields the same result as calling the function `do.call`. In particular, instead of executing the call `rbindlist(persons)`, we could have used `do.call(rbind, persons)`.
```{r}
# Combining the datasets
persons <- rbindlist(persons)
persons
```
Now that we have a single dataset, we can do some data exploration. For example, we can use the package `ggplot2` to draw a histogram of alters' ages.
```{r part-01-07-ggplot-ages}
# Loading the ggplot2 package
library(ggplot2)
# Histogram of age
ggplot(persons, aes(x = age)) + # Starting off the plot
geom_histogram(fill = "purple") + # Adding a histogram
labs(x = "Age", y = "Frequency") + # Changing the x/y axis labels
labs(title = "Alter's Age Distribution") # Adding a title
```
## Ego files
The ego files contain information about egos (duh!.) Again, we will read them all at once using `list.files` + `lapply`:
```{r read-ego}
# Listing files ending with *ego.csv
ego_files <- list.files(
path = "data-raw/egonets",
pattern = "*ego.csv",
full.names = TRUE
)
# Reading the files with fread
egos <- lapply(ego_files, fread)
# Combining them
egos <- rbindlist(egos)
head(egos)
```
A cool thing about `data.table` is that, within square brackets, we can manipulate the data referring to the variables directly. For example, if we wanted to calculate the difference between `sessionFinish` and `sessionStart`, using base R we would do the following:
```r
egos$total_time <- egos$sessionFinish - egos$sessionStart
```
Whereas with `data.table`, variable creation is much more straightforward (notice that instead of using `<-` or `=` to assign a variable, we use the `:=` operator):
```{r egos-time}
# How much time?
egos[, total_time := sessionFinish - sessionStart]
```
We can also visualize this using `ggplot2`:
```{r egos-time-plot}
ggplot(egos, aes(x = total_time)) +
geom_histogram() +
labs(x = "Time in minutes", y = "Count") +
labs(title = "Total time spent by egos")
```
## Edgelist files
As I mentioned earlier, since we are reading the `graphml` files, using the edgelist may not be needed. Nevertheless, the process to import the edgelist file to R is the same we have been applying: list the files and read them all at once using `lapply`:
```{r read-edgelist}
# Listing all files ending in Knows.csv
edgelist_files <- list.files(
path = "data-raw/egonets",
pattern = "*Knows.csv",
full.names = TRUE
)
# Reading all files at once
edgelists <- lapply(edgelist_files, fread)
```
To avoid confusion, we can also add ids corresponding to the file number. Once we do that, we can combine all files into a single `data.table` object using `rbindlist`:
```{r part-01-07-adding-ids-edgelist}
edgelists <- lapply(seq_along(edgelists), function(i) {
edgelists[[i]][, dataset_num := i]
})
edgelists <- rbindlist(edgelists)
head(edgelists)
```
## Putting all together
In this last part of the chapter, we will use the `igraph` and `ergm` packages to generate features (covariates, controls, independent variables, or whatever you call them) at the ego-network level. Once again, the `lapply` function is our friend
### Generating statistics using igraph
The `igraph` R package has multiple high-performing routines to compute graph-level statistics. For now, we will focus on the following statistics: vertex count, edge count, number of isolates, transitivity, and modularity based on betweenness centrality:
```{r gen-stats}
net_stats <- lapply(graphs, function(g) {
# Calculating modularity
groups <- cluster_edge_betweenness(g)
# Computing the stats
data.table(
size = vcount(g),
edges = ecount(g),
nisolates = sum(degree(g) == 0),
transit = transitivity(g, type = "global"),
modular = modularity(groups)
)
})
```
Observe we count isolates using the `degree()` function. We can combine the statistics into a single `data.table` using the `rbindlist` function:
```{r}
net_stats <- rbindlist(net_stats)
head(net_stats)
```
### Generating statistics based on ergm
The `ergm` R package has a much larger set of graph-level statistics we can add to our models.[^ergm-stats] The key to generating statistics based on the `ergm` package is the `summary_formula` function. Before we start using that function, we first need to convert the `igraph` networks to `network` objects, which are the native object class for the `ergm` package. We use the <a href="https://cran.r-project.org/package=intergraph" target="_blank">`intergraph`</a> R package for that, and in particular, the `asNetwork` function:
[^ergm-stats]: There's an obvious reason, ERGMs are all about graph-level statistics!
```{r part-01-07-igraph-to-network}
# Loading the required packages
library(intergraph)
library(ergm)
# Converting all "igraph" objects in graphs to network "objects"
graphs_network <- lapply(graphs, asNetwork)
```
With the network objects ready, we can proceed to compute graph-level statistics using the `summary_formula` function. Here we will only look into: the number of triangles, gender homophily, and healthy-diet homophily:
```{r part-01-07-gen-stats-ergm}
net_stats_ergm <- lapply(graphs_network, function(n) {
# Computing the statistics
s <- summary_formula(
n ~ triangles +
nodematch("gender_1") +
nodematch("healthy_diet")
)
# Saving them as a data.table object
data.table(
triangles = s[1],
gender_homoph = s[2],
healthyd_homoph = s[3]
)
})
```
Once again, we use `rbindlist` to combine all the network statistics into a single `data.table` object:
```{r part-01-07-combine-ergm-stats}
net_stats_ergm <- rbindlist(net_stats_ergm)
head(net_stats_ergm)
```
## Saving the data
We end the chapter saving all our work into four datasets:
- Network statistics (as a csv file)
- Igraph objects (as a rda file, which we can read back using `read.rds`)
- Network objects (idem)
- Person files (alter's information, as a csv file.)
CSV files can be saved either using `write.csv` or, as we do here, `fwrite` from the `data.table` package:
```{r saving}
# Checking directory exists
if (!dir.exists("data"))
dir.create("data")
# Network attributes
master <- cbind(egos, net_stats, net_stats_ergm)
fwrite(master, file = "data/network_stats.csv")
# Networks
saveRDS(graphs, file = "data/networks_igraph.rds")
saveRDS(graphs_network, file = "data/networks_network.rds")
# Attributes
fwrite(persons, file = "data/persons.csv")
```
```{r, include=FALSE}
knitr::opts_chunk$set(collapse = FALSE)
```