This respository contains my efforts to predict how many goals a Premier League team would score in a certain match.
First, I had to create a dataset that could be used to try and predict the number of goals scored in a game. To accomplish this, I webscraped Premier League match data from FBref. The code that I wrote to webscrape from FBref is contained in the FBref_scraper.R
and FBref_scraper_basic.R
. FBref_scraper_basic.R
contains code to webscrape one year's worth of Premier League match data (2019-2020) and FBref_scraper.R
contains code to webscrape more than one year's worth of Premier League match data. Pl_team_match_data.csv
is the dataset that I was able to create from FBref which contains Premier League match data from the 2017-2018, 2018-2019, and 2019-2020 Premier League campaigns. This dataset contains 103 variables per team per match. Information about what each of the variables are can be found within FBref_scraper.R
and/or on FBref.
The PLXG_modeling.R
script contains my code and attempts to create a model best suited to predicting number of goals scored in a game by a Premier League team, or expected goals. I tried Poisson regression, random forest, ranger, xgboost models. Inital models were built using all 102 possible explanatory variables, and I found that the best model was an xgbTree model built using the caret library. I was able to improve the xgbTree model performance by using only the 10 most "important" variables instead of all 102. The performance metric I used to measure my model's predictions was Root Mean Square Error (RMSE). Also, my models were built using an 80/20 split of train/test data.
Finally, the PLXG App
folder contains the optimal model, PLXGModel.RData
, a dataset with only the 10 most "important" variables of match data for each Premier League team, Full_PL_10.csv
, and the code for my Shiny application, app.R
. This app allows a user to walkthrough my process of getting data and building an expected Goals model and predict for themselves with the model. Users can select a team and enter their own inputs for the variables used to build the model to see how expected goals for a match would change given different values. Users can also select a certain match to see what the predicted goal output was for a certain Premier League team against a certain opponent on a certain date from the 2017-2018, 2018-2019, and 2019-2020 Premier League campaigns. Additionally, users can look at visualizations of the variables for each Premier League team to get a better idea of what values were realistic for teams.
Here is the link to the PLXG app. It is hosted on https://www.shinyapps.io.