part-01-07-egonets.qmd

---
date-modified: 2024-05-27
---

# Egocentric networks

In egocentric social network analysis (ESNA, for our book,) instead of dealing with a single network, we have as many networks as participants in the study. Egos--the main study subjects--are analyzed from the perspective of their local social network. For a more extended view of ESNA, look at Raffaele Vacca's <a href="https://raffaelevacca.github.io/egocentric-r-book/" target="_blank">"*Egocentric network analysis with R*"</a>.

In this chapter, I show how to work with one particular type of ESNA data: information generated by the tool <a href="https://networkcanvas.com/" target="_blank">Network Canvas</a>. You can download an "artificial" ZIP file containing the outputs from a Network Canvas project [here](data-raw/networkCanvasExport-fake.zip)[^netcanvas-file]. We assume the ZIP file was extracted to the folder `data-raw/egonets`. You can go ahead and extract the ZIP by point-and-click or use the following R code to automate the process:

[^netcanvas-file]: I thank [Jacqueline M. Kent-Marvick](https://scholar.google.com/citations?user=Uht4YbkAAAAJ){target="_blank"}, who provided me with what I used as a baseline to generate the artificial Network Canvas export.

```{r echo=FALSE, warning=FALSE, message=FALSE}
# To make the file available
if (dir.exists("docs/data-raw"))
  dir.create("docs/data-raw", recursive=TRUE)
file.copy(
  "data-raw/networkCanvasExport-fake.zip",
  "docs/data-raw/networkCanvasExport-fake.zip"
  )

knitr::opts_chunk$set(collapse = TRUE)
```

```{r unzip}
unzip(
  zipfile = "data-raw/networkCanvasExport-fake.zip",
  exdir   = "data-raw/egonets"
  )
```

This will extract all the files in `networkCanvasExport-fake.zip` to the subfolder `egonets`. Let's take a look at the first few files:

```{r part-01-07-egonets-files}
head(list.files(path = "data-raw/egonets"))
```

As you can see, for each ego in the dataset, there are four files:

- `...attributeList_Person.csv`: Attributes of the alters.

- `...edgeList_Knows.csv`: Edgelist indicating the ties between the alters.

- `...ego.csv`: Information about the egos.

- `...graphml`: And a <a href="https://en.wikipedia.org/wiki/GraphML" target="_blank">`graphml` file</a> that contains the egonets.

The next sections will illustrate, file by file, how to read the information into R, apply any required processing, and store the information for later use. We start with the `graphml` files.

## Network files (graphml)

The `graphml` files can be read directly with `igraph`'s `read_graph` function. The key is to take advantage of R's lists to avoid writing over and over the same block of code, and, instead, manage the data through lists. 

Just like any data-reading function, `read_graph` function requires a file path to the network file. **The function we will use to list the required files is `list.files()`**:

```{r netread, message=FALSE}
# We start by loading igraph
library(igraph)

# Listing all the graphml files
graph_files <- list.files(
  path       = "data-raw/egonets", # Where are these files
  pattern    = "*.graphml",        # Specify a pattern for only listing graphml
  full.names = TRUE                # And we make sure we use the full name
                                   # (path.) Otherwise, we would only get names.
  )

# Taking a look at the first three files we got
graph_files[1:3]

# Applying igraph's read_graph
graphs <- lapply(
  X      = graph_files,       # List of files to read
  FUN    = read_graph,        # The function to apply
  format = "graphml"          # Argument passed to read_graph
  )
```

If the operation succeeded, the previous code block should generate a list of `igraph` objects named `graphs`. Let's take a peek at the first two:

```{r}
graphs[[1]]
graphs[[2]]
```

As always, one of the first things we do with networks is visualize them. We will use the <a href="https://cran.r-project.org/package=netplot" target="_blank">`netplot` R package</a> (by yours truly) to draw the figures:

```{r plot-nets, message=FALSE, warning=FALSE}
library(netplot)
library(gridExtra)

# Graph layout is random
set.seed(1231)

# The grid.arrange allows putting multiple netplot graphs into the same page
grid.arrange(
  nplot(graphs[[1]]),
  nplot(graphs[[2]]),
  nplot(graphs[[3]]),
  nplot(graphs[[4]]),
  ncol = 2, nrow = 2
)
```

Great! Since nodes in our network have features, we can add a little color. We will use the `eat_with_2` variable, coded as `TRUE` or `FALSE`. Vertex colors can be specified using the `vertex.color` argument of the `nplot` function. In our case, we will specify colors passing a vector with length equal to the number of nodes in the graph. Furthermore, since we will be doing this multiple times, it is worthwhile writing a function:

```{r plot-nets-colored}
# A function to color by the eat with variable
color_it <- function(net) {

  # Coding eat_with_2 to be 1 (FALSE) or 2 (TRUE)
  eatswith <- V(net)$eat_with_2

  # Subsetting the color
  ifelse(eatswith, "purple", "darkgreen")

}
```

This function takes two arguments: a network and a vector of two colors. Vertex attributes in `igraph` can be accessed through the `V(...)$...` function. For this example, to access the attribute `eat_with_2` in the network `net`, we type `V(net)$eat_with_2`. Finally, individuals with `eat_with_2` equal to true will be colored `purple`; otherwise, if equal to `FALSE`, they will be colored `darkgreen`. Before plotting the networks, let's see what we get when we access the `eat_with_2` attribute in the first graph:

```{r}
V(graphs[[1]])$eat_with_2
```

A logical vector. Now let's redraw the figures:

```{r part-01-07-plot-nets-colored}
grid.arrange(
  nplot(graphs[[1]], vertex.color = color_it(graphs[[1]])),
  nplot(graphs[[2]], vertex.color = color_it(graphs[[2]])),
  nplot(graphs[[3]], vertex.color = color_it(graphs[[3]])),
  nplot(graphs[[4]], vertex.color = color_it(graphs[[4]])),
  ncol = 2, nrow = 2
)
```

Since most of the time, we will be dealing with many egonets; you may want to draw each network independently; the following code block does that. First, if needed, will create a folder to store the networks. Then, using the `lapply` function, it will use `netplot::nplot()` to draw the networks, add a legend, and save the graph as `.../graphml_[number].png`, where `[number]` will go from `01` to the total number of networks in `graphs`.

```{r plot-net-all, eval = FALSE}
if (!dir.exists("egonets/figs/egonets"))
  dir.create("egonets/figs/egonets", recursive = TRUE)

lapply(seq_along(graphs), function(i) {
  
  # Creating the device 
  png(sprintf("egonets/figs/egonets/graphml_%02i.png", i))  
  
  # Drawing the plot
  p <- nplot(
    graphs[[i]],
    vertex.color = color_it(graphs[[i]])
    )
  
  # Adding a legend
  p <- nplot_legend(
    p,
    labels = c("eats with: FALSE", "eats with: TRUE"),
    pch    = 21,
    packgrob.args = list(side = "bottom"),
    gp            = gpar(
      fill = c("darkgreen", "purple")
    ),
    ncol = 2
  )
  
  print(p)
  
  # Closing the device
  dev.off()
})
```


## Person files

Like before, we list the files ending in `Person.csv` (with the full path,) and read them into R. While R has the function `read.csv`, here I use the function `fread` from the <a href="https://cran.r-project.org/package=data.table" target="_blank">`data.table`</a> R package. Alongside `dplyr`, `data.table` is one of the most popular data-wrangling tools in R. Besides syntax, the biggest difference between the two is performance; `data.table` is significantly faster than any other data management package in R, and is a great alternative for handling large datasets. The following code block loads the package, lists the files, and reads them into R.

```{r read-person}
# Loading data.table
library(data.table)

# Listing the files
person_files <- list.files(
  path       = "data-raw/egonets",
  pattern    = "*Person.csv",
  full.names = TRUE
  )

# Loading all into a single list
persons <- lapply(person_files, fread)

# Looking into the first element
persons[[1]]
```

A common task is adding an identifier to each dataset in `persons` so we know from to which ego they belong. Again, the `lapply` function is our friend:


```{r part-01-07-adding-ids}
persons <- lapply(seq_along(persons), function(i) {
  persons[[i]][, dataset_num := i]
})
```

In `data.table`, variables are created using the `:=` symbol. The previous code chunk is equivalent to this:

```r
for (i in 1:length(persons)) {
  persons[[i]]$dataset_num <- i
}
```

If needed, we can transform the list `persons` into a `data.table` object (i.e., a single `data.frame`) using the `rbindlist` function[^rbindlist]. The next code block uses that function to combine the `data.table`s into a single dataset.

[^rbindlist]: Although not the same, `rbindlist` (almost always) yields the same result as calling the function `do.call`. In particular, instead of executing the call `rbindlist(persons)`, we could have used `do.call(rbind, persons)`.

```{r}
# Combining the datasets
persons <- rbindlist(persons)
persons
```

Now that we have a single dataset, we can do some data exploration. For example, we can use the package `ggplot2` to draw a histogram of alters' ages.

```{r part-01-07-ggplot-ages}
# Loading the ggplot2 package
library(ggplot2)

# Histogram of age
ggplot(persons, aes(x = age)) +            # Starting off the plot
  geom_histogram(fill = "purple") +      # Adding a histogram
  labs(x = "Age", y = "Frequency") +       # Changing the x/y axis labels
  labs(title = "Alter's Age Distribution") # Adding a title
```


## Ego files

The ego files contain information about egos (duh!.) Again, we will read them all at once using `list.files` + `lapply`:

```{r read-ego}
# Listing files ending with *ego.csv
ego_files <- list.files(
  path       = "data-raw/egonets",
  pattern    = "*ego.csv",
  full.names = TRUE
  )

# Reading the files with fread
egos <- lapply(ego_files, fread)

# Combining them
egos <- rbindlist(egos)
head(egos)
```

A cool thing about `data.table` is that, within square brackets, we can manipulate the data referring to the variables directly. For example, if we wanted to calculate the difference between `sessionFinish` and `sessionStart`, using base R we would do the following:

```r
egos$total_time <- egos$sessionFinish - egos$sessionStart
```

Whereas with `data.table`, variable creation is much more straightforward (notice that instead of using `<-` or `=` to assign a variable, we use the `:=` operator):

```{r egos-time}
# How much time?
egos[, total_time := sessionFinish - sessionStart]
```

We can also visualize this using `ggplot2`:

```{r egos-time-plot}
ggplot(egos, aes(x = total_time)) +
  geom_histogram() +
  labs(x = "Time in minutes", y = "Count") +
  labs(title = "Total time spent by egos")
```

## Edgelist files

As I mentioned earlier, since we are reading the `graphml` files, using the edgelist may not be needed. Nevertheless, the process to import the edgelist file to R is the same we have been applying: list the files and read them all at once using `lapply`:

```{r read-edgelist}
# Listing all files ending in Knows.csv
edgelist_files <- list.files(
  path = "data-raw/egonets",
  pattern = "*Knows.csv",
  full.names = TRUE
  )

# Reading all files at once
edgelists <- lapply(edgelist_files, fread)
```

To avoid confusion, we can also add ids corresponding to the file number. Once we do that, we can combine all files into a single `data.table` object using `rbindlist`:

```{r part-01-07-adding-ids-edgelist}
edgelists <- lapply(seq_along(edgelists), function(i) {
  edgelists[[i]][, dataset_num := i]
})

edgelists <- rbindlist(edgelists)

head(edgelists)
```


## Putting all together

In this last part of the chapter, we will use the `igraph` and `ergm` packages to generate features (covariates, controls, independent variables, or whatever you call them) at the ego-network level. Once again, the `lapply` function is our friend

### Generating statistics using igraph

The `igraph` R package has multiple high-performing routines to compute graph-level statistics. For now, we will focus on the following statistics: vertex count, edge count, number of isolates, transitivity, and modularity based on betweenness centrality:

```{r gen-stats}
net_stats <- lapply(graphs, function(g) {
  
  # Calculating modularity
  groups <- cluster_edge_betweenness(g)
  
  # Computing the stats
  data.table(
    size      = vcount(g),
    edges     = ecount(g),
    nisolates = sum(degree(g) == 0),
    transit   = transitivity(g, type = "global"),
    modular   = modularity(groups)
  )
})
```

Observe we count isolates using the `degree()` function. We can combine the statistics into a single `data.table` using the `rbindlist` function:

```{r}
net_stats <- rbindlist(net_stats)

head(net_stats)
```

### Generating statistics based on ergm

The `ergm` R package has a much larger set of graph-level statistics we can add to our models.[^ergm-stats] The key to generating statistics based on the `ergm` package is the `summary_formula` function. Before we start using that function, we first need to convert the `igraph` networks to `network` objects, which are the native object class for the `ergm` package. We use the <a href="https://cran.r-project.org/package=intergraph" target="_blank">`intergraph`</a> R package for that, and in particular, the `asNetwork` function:

[^ergm-stats]: There's an obvious reason, ERGMs are all about graph-level statistics!

```{r part-01-07-igraph-to-network}
# Loading the required packages
library(intergraph)
library(ergm)

# Converting all "igraph" objects in graphs to network "objects"
graphs_network <- lapply(graphs, asNetwork)
```

With the network objects ready, we can proceed to compute graph-level statistics using the `summary_formula` function. Here we will only look into: the number of triangles, gender homophily, and healthy-diet homophily:

```{r part-01-07-gen-stats-ergm}
net_stats_ergm <- lapply(graphs_network, function(n) {
  
  # Computing the statistics
  s <- summary_formula(
    n ~ triangles +
      nodematch("gender_1") +
      nodematch("healthy_diet")
    )
  
  # Saving them as a data.table object
  data.table(
    triangles       = s[1],
    gender_homoph   = s[2],
    healthyd_homoph = s[3]
  )
})
```

Once again, we use `rbindlist` to combine all the network statistics into a single `data.table` object:

```{r part-01-07-combine-ergm-stats}
net_stats_ergm <- rbindlist(net_stats_ergm)
head(net_stats_ergm)
```

## Saving the data

We end the chapter saving all our work into four datasets:

- Network statistics (as a csv file)

- Igraph objects (as a rda file, which we can read back using `read.rds`)

- Network objects (idem)

- Person files (alter's information, as a csv file.)

CSV files can be saved either using `write.csv` or, as we do here, `fwrite` from the `data.table` package:

```{r saving}
# Checking directory exists
if (!dir.exists("data"))
  dir.create("data")

# Network attributes
master <- cbind(egos, net_stats, net_stats_ergm)
fwrite(master, file = "data/network_stats.csv")

# Networks
saveRDS(graphs, file = "data/networks_igraph.rds")
saveRDS(graphs_network, file = "data/networks_network.rds")

# Attributes
fwrite(persons, file = "data/persons.csv")
```

```{r, include=FALSE}
knitr::opts_chunk$set(collapse = FALSE)
```