Skip to content

Latest commit

 

History

History
392 lines (362 loc) · 7.91 KB

2.xG&dplyr.md

File metadata and controls

392 lines (362 loc) · 7.91 KB

xG and dplyr - group_by(), arrange() and top_n()

The dplyr package in R is an extremely powerful tool that helps organise, group and filter data with intuitive ease. We will show its power using StatsBomb free women's football data. To get the data and be setup for this tutorial follow my previous guide.

First of all we will load the events database we saved as an RDS file from the initial Statsbomb Guide.

events <- readRDS("SB_events_DB.RDS")
head(events[1:5]) # just to test it loaded well 
##                                     id index period    timestamp minute
## 1 1055bdac-0320-4ca4-b0a1-38624245501a     1      1 00:00:00.000      0
## 2 78e63f61-3bf1-4a9b-9c4f-27b43e5ed71e     2      1 00:00:00.000      0
## 3 7fe92118-5965-4033-9b59-29b3947a3d8a     3      1 00:00:00.000      0
## 4 c007670e-f679-4f80-b901-b704130fee05     4      1 00:00:00.000      0
## 5 7925a1d3-fc1c-458a-9300-37fa25a2b137     5      1 00:00:00.100      0
## 6 dddef0cb-75ef-4e0e-8f45-e8b7dd9e2c7d     6      1 00:00:00.100      0

xG Tables - Player Level with Top 10 Tables

Let's say we want to make a Top 10 overview of Expected Goals (xG) from all of the events we have in our database. Let's use the power of dplyr and piping (%>%) to quickly create as display these in two line of code. Let's first load the packages.

require(dplyr) # load the dpylr package
## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(formattable) # load the formattable package - make sure its installed if not
## Loading required package: formattable

Piping is a very powerful concept, and allows us to pass a temporary dataframe from left to right, whilst completing the calcuation defined left of the pipe. In the example below, we pass our dataframe 'events' into the pipe, and then group the data by player.name variable, pass it through the pipe, then sum all of the shot.statsbomb_xg events per player, pass it through the pipe, arrange the data frame in descending order with the xG variable, pass it through the pipe and finally take just the top 10.

We then display the outcome using the formattable package as it has nice and minimal formatting.

Overview <- events %>% ## take the events dataframe
  group_by(player.name) %>% ## group by player.name
  summarise(xG = sum(shot.statsbomb_xg, na.rm=TRUE)) %>% ## create a xG total by summing shot.statsbomb_xg 
  arrange(desc(xG)) %>% ## sort the results by xG in desending order
  top_n(10) ## select the top 10 
## Selecting by xG
formattable(Overview) ## display using formattable package
player.name xG
Jessica McDonald 3.1030445
Crystal Alyssia Dunn 1.8279918
Jodie Taylor 1.4601885
Lynn Williams 1.3110152
Ana-Maria Crnogorcevic 1.1316334
Samantha Mewis 0.9054807
Débora Cristiane de Oliveira 0.8887439
Ashley Hatch 0.8830589
Francesca Kirby 0.7904245
Alanna Kennedy 0.7675679
There are a few things that I want to tidy up, first is that the xG totals could be rounded to 2 decimal places and the column headings could be changed to 'Player' and xG to 'xG Total'.
require(formattable) # load the formattable package - make sure its installed if not

Overview <- events %>% ## take the events dataframe
  group_by(player.name) %>% ## group by player.name
  summarise(xG = round(sum(shot.statsbomb_xg, na.rm=TRUE),2)) %>% ## create a xG total by summing shot.statsbomb_xg 
  arrange(desc(xG)) %>% ## sort the results by xG in desending order
  top_n(10) ## select the top 10 
## Selecting by xG
colnames(Overview) <- c("Player", "xG Total")

formattable(Overview) ## display using formattable package
Player xG Total
Jessica McDonald 3.10
Crystal Alyssia Dunn 1.83
Jodie Taylor 1.46
Lynn Williams 1.31
Ana-Maria Crnogorcevic 1.13
Samantha Mewis 0.91
Débora Cristiane de Oliveira 0.89
Ashley Hatch 0.88
Francesca Kirby 0.79
Alanna Kennedy 0.77
xG Tables - Team Level Comparisons ----------------------------------

We can use dplyr to easily create summaries of team level xG totals by adjusting which

TeamOverview <- events %>% 
  group_by(team.name) %>% 
  summarise(xG = round(sum(shot.statsbomb_xg, na.rm=TRUE),2)) %>%
  arrange(desc(xG)) 
  
colnames(TeamOverview) <- c("Team", "xG Total")

formattable(TeamOverview)
Team xG Total
North Carolina Courage 9.92
Portland Thorns FC 4.34
Orlando Pride SC 3.32
Seattle Reign FC 3.08
Chelsea LFC 2.58
Washington Spirit 1.72
Utah Royals FC 1.60
Chicago Red Stars 1.16
Houston Dash 0.90
Manchester City WFC 0.90
Sky Blue FC 0.68