The dplyr package in R is an extremely powerful tool that helps organise, group and filter data with intuitive ease. We will show its power using StatsBomb free women's football data. To get the data and be setup for this tutorial follow my previous guide.
First of all we will load the events database we saved as an RDS file from the initial Statsbomb Guide.
events <- readRDS("SB_events_DB.RDS")
head(events[1:5]) # just to test it loaded well
## id index period timestamp minute
## 1 1055bdac-0320-4ca4-b0a1-38624245501a 1 1 00:00:00.000 0
## 2 78e63f61-3bf1-4a9b-9c4f-27b43e5ed71e 2 1 00:00:00.000 0
## 3 7fe92118-5965-4033-9b59-29b3947a3d8a 3 1 00:00:00.000 0
## 4 c007670e-f679-4f80-b901-b704130fee05 4 1 00:00:00.000 0
## 5 7925a1d3-fc1c-458a-9300-37fa25a2b137 5 1 00:00:00.100 0
## 6 dddef0cb-75ef-4e0e-8f45-e8b7dd9e2c7d 6 1 00:00:00.100 0
Let's say we want to make a Top 10 overview of Expected Goals (xG) from all of the events we have in our database. Let's use the power of dplyr and piping (%>%) to quickly create as display these in two line of code. Let's first load the packages.
require(dplyr) # load the dpylr package
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(formattable) # load the formattable package - make sure its installed if not
## Loading required package: formattable
Piping is a very powerful concept, and allows us to pass a temporary dataframe from left to right, whilst completing the calcuation defined left of the pipe. In the example below, we pass our dataframe 'events' into the pipe, and then group the data by player.name variable, pass it through the pipe, then sum all of the shot.statsbomb_xg events per player, pass it through the pipe, arrange the data frame in descending order with the xG variable, pass it through the pipe and finally take just the top 10.
We then display the outcome using the formattable package as it has nice and minimal formatting.
Overview <- events %>% ## take the events dataframe
group_by(player.name) %>% ## group by player.name
summarise(xG = sum(shot.statsbomb_xg, na.rm=TRUE)) %>% ## create a xG total by summing shot.statsbomb_xg
arrange(desc(xG)) %>% ## sort the results by xG in desending order
top_n(10) ## select the top 10
## Selecting by xG
formattable(Overview) ## display using formattable package
player.name | xG |
---|---|
Jessica McDonald | 3.1030445 |
Crystal Alyssia Dunn | 1.8279918 |
Jodie Taylor | 1.4601885 |
Lynn Williams | 1.3110152 |
Ana-Maria Crnogorcevic | 1.1316334 |
Samantha Mewis | 0.9054807 |
Débora Cristiane de Oliveira | 0.8887439 |
Ashley Hatch | 0.8830589 |
Francesca Kirby | 0.7904245 |
Alanna Kennedy | 0.7675679 |
require(formattable) # load the formattable package - make sure its installed if not
Overview <- events %>% ## take the events dataframe
group_by(player.name) %>% ## group by player.name
summarise(xG = round(sum(shot.statsbomb_xg, na.rm=TRUE),2)) %>% ## create a xG total by summing shot.statsbomb_xg
arrange(desc(xG)) %>% ## sort the results by xG in desending order
top_n(10) ## select the top 10
## Selecting by xG
colnames(Overview) <- c("Player", "xG Total")
formattable(Overview) ## display using formattable package
Player | xG Total |
---|---|
Jessica McDonald | 3.10 |
Crystal Alyssia Dunn | 1.83 |
Jodie Taylor | 1.46 |
Lynn Williams | 1.31 |
Ana-Maria Crnogorcevic | 1.13 |
Samantha Mewis | 0.91 |
Débora Cristiane de Oliveira | 0.89 |
Ashley Hatch | 0.88 |
Francesca Kirby | 0.79 |
Alanna Kennedy | 0.77 |
We can use dplyr to easily create summaries of team level xG totals by adjusting which
TeamOverview <- events %>%
group_by(team.name) %>%
summarise(xG = round(sum(shot.statsbomb_xg, na.rm=TRUE),2)) %>%
arrange(desc(xG))
colnames(TeamOverview) <- c("Team", "xG Total")
formattable(TeamOverview)
Team | xG Total |
---|---|
North Carolina Courage | 9.92 |
Portland Thorns FC | 4.34 |
Orlando Pride SC | 3.32 |
Seattle Reign FC | 3.08 |
Chelsea LFC | 2.58 |
Washington Spirit | 1.72 |
Utah Royals FC | 1.60 |
Chicago Red Stars | 1.16 |
Houston Dash | 0.90 |
Manchester City WFC | 0.90 |
Sky Blue FC | 0.68 |