Skip to content

Commit

Permalink
Merge branch 'master' into new-stats-approach
Browse files Browse the repository at this point in the history
  • Loading branch information
mrcaseb authored Oct 15, 2024
2 parents b367fb4 + eb133e6 commit 7942ae7
Show file tree
Hide file tree
Showing 14 changed files with 82 additions and 119 deletions.
6 changes: 3 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: nflfastR
Title: Functions to Efficiently Access NFL Play by Play Data
Version: 4.6.1.9011
Version: 4.6.1.9018
Authors@R:
c(person(given = "Sebastian",
family = "Carl",
Expand Down Expand Up @@ -44,7 +44,7 @@ Depends:
Imports:
cli (>= 3.0.0),
curl,
data.table (>= 1.14.0),
data.table (>= 1.15.0),
dplyr (>= 1.0.0),
fastrmodels (>= 1.0.1),
furrr,
Expand All @@ -55,7 +55,7 @@ Imports:
nflreadr (>= 1.2.0),
progressr (>= 0.6.0),
rlang (>= 0.4.7),
stringr (>= 1.3.0),
stringr (>= 1.4.0),
tibble (>= 3.0),
tidyr (>= 1.0.0),
tidyselect (>= 1.1.0),
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ import(fastrmodels)
importFrom(cli,rule)
importFrom(curl,curl_fetch_memory)
importFrom(data.table,"%between%")
importFrom(data.table,"%chin%")
importFrom(data.table,setDT)
importFrom(furrr,future_map)
importFrom(furrr,future_map_chr)
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@
- Added identification of scrambles from 1999 through 2004 with thank to Aaron Schatz (#468)
- Added new function `calculate_stats()` that combines the output of all `calculate_player_stats*()` functions with a more robust and faster approach. The `calculate_player_stats*()` function will be deprecated.
- Updated the dataframe `stat_ids` with some IDs that were previously missing.
- nflfastR tried to fix bugs in the underlying pbp data of JAX home games prior to the 2016 season. An update of the raw pbp data resolved those bugs so nflfastR needs to remove the hard coded adjustments. This means that nflfastR <= v4.6.1 will return incorrect pbp data for all Jacksonville home games prior to the 2016 season! (#478)
- Fixed a problem where `clean_pbp()` returned `pass = 1` in actual rush plays in very rare cases. (#479)
- Removed extra lines for injury timeouts that were breaking `fixed_drive` (#482)
- The variable `penalty_type` now correctly lists the penalty "Kickoff Short of Landing Zone" introduced in the 2024 season. (#486)

# nflfastR 4.6.1

Expand Down
12 changes: 6 additions & 6 deletions R/helper_add_nflscrapr_mutations.R
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,12 @@ add_nflscrapr_mutations <- function(pbp) {
stringr::str_remove("\\([0-9]{2}+ Yards\\)") %>%
stringr::str_squish(), NA_character_
),
# The new "dynamic Kickoff" in the 2024 season introduces a new penalty type
penalty_type = dplyr::if_else(
.data$penalty == 1 & stringr::str_detect(tolower(.data$play_description), "kickoff short of landing zone"),
"Kickoff Short of Landing Zone",
.data$penalty_type
),
# Make plays marked with down == 0 as NA:
down = dplyr::if_else(
.data$down == 0,
Expand Down Expand Up @@ -397,12 +403,6 @@ add_nflscrapr_mutations <- function(pbp) {
.data$replay_or_challenge == 1 & .data$timeout == 1 & is.na(.data$timeout_team), .data$tmp_timeout, .data$timeout_team

),
timeout_team = dplyr::if_else(
.data$season <= 2015 & (.data$home_team %in% c("JAC", "JAX") | .data$away_team %in% c("JAC", "JAX")) & .data$timeout_team == "JAX",
"JAC",
.data$timeout_team
),


home_timeouts_remaining = dplyr::if_else(
.data$quarter %in% c(1, 2, 3, 4),
Expand Down
33 changes: 30 additions & 3 deletions R/helper_additional_functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -104,16 +104,19 @@ clean_pbp <- function(pbp, ...) {
),
# if there's a pass, sack, or scramble, it's a pass play...
pass = dplyr::if_else(stringr::str_detect(.data$desc, "( pass )|(sacked)|(scramble)") | .data$qb_scramble == 1, 1, 0),
# ...unless it says "backwards pass" and there's a rusher
# ...unless it says "backward(s) pass" or "lateral pass" and there's a rusher
pass = dplyr::if_else(
stringr::str_detect(.data$desc, "(backward pass)|(Backward pass)") & !is.na(.data$rusher),
stringr::str_detect(stringr::str_to_lower(.data$desc), "(backward pass)|(backwards pass)|(lateral pass)") & !is.na(.data$rusher),
0, .data$pass
),
# and make sure there's no pass on a kickoff (sometimes there's forward pass on kickoff but that's not a pass play)
pass = dplyr::case_when(
.data$kickoff_attempt == 1 ~ 0,
TRUE ~ .data$pass
),
# in very rare cases, the pass logic can fail. We do a hard coded overwrite here because it's not worth the time
# to overthink the logic to catch weird play descriptions.
pass = fix_weird_pass_plays(.data$pass, .data$game_id, .data$play_id),
#if there's a rusher and it wasn't a QB kneel or pass play, it's a run play
rush = dplyr::if_else(!is.na(.data$rusher) & .data$qb_kneel == 0 & .data$pass == 0, 1, 0),
#fix some common QBs with inconsistent names
Expand Down Expand Up @@ -281,7 +284,7 @@ clean_pbp <- function(pbp, ...) {
big_parser <- "(?<=)[A-Z][A-z]*+(\\.|\\s)+[A-Z][A-z]*+\\'*\\-*[A-Z]*+[a-z]*+(\\s((Jr.)|(Sr.)|I{2,3})|(IV))?"
# maybe some spaces and letters, and then a rush direction unless they fumbled
rush_finder <- "(?=\\s*[a-z]*+\\s*((FUMBLES) | (left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)))"
# maybe some spaces and leters, and then pass / sack / scramble
# maybe some spaces and letters, and then pass / sack / scramble
pass_finder <- "(?=\\s*[a-z]*+\\s*(( pass)|(sack)|(scramble)))"
# to or for, maybe a jersey number and a dash
receiver_finder <- "(?<=((to)|(for))\\s[:digit:]{0,2}\\-{0,1})"
Expand Down Expand Up @@ -313,6 +316,7 @@ team_name_fn <- function(var) {
"JAC" = "JAX",
"STL" = "LA",
"SL" = "LA",
"LAR" = "LA",
"ARZ" = "ARI",
"BLT" = "BAL",
"CLV" = "CLE",
Expand Down Expand Up @@ -401,3 +405,26 @@ add_qb_epa <- function(pbp, ...) {
return(pbp)
}

# Function that fixes false "pass" positives in some hard coded plays where
# the parser logic reached its limit
fix_weird_pass_plays <- function(pass, game_id, play_id){
combined_id <- paste(game_id, play_id, sep = "_")
false_positives <- c(
"1999_01_ARI_PHI_1611",
"1999_01_SF_JAX_1788",
"1999_01_SF_JAX_2081",
"1999_11_ATL_TB_1740",
"2001_09_MIN_PHI_1307",
"2001_14_NE_BUF_452",
"2002_16_PIT_TB_527",
"2003_02_HOU_NO_3924",
"2003_15_PIT_NYJ_873",
"2004_05_BUF_NYJ_2555",
"2005_07_SD_PHI_321",
"2011_02_STL_NYG_1369",
"2016_05_NE_CLE_912",
"2016_06_CAR_NO_2690",
"2020_10_BAL_NE_2013"
)
data.table::fifelse(combined_id %chin% false_positives, 0, pass, pass)
}
132 changes: 31 additions & 101 deletions R/helper_scrape_nfl.R
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,19 @@ get_pbp_nfl <- function(id,
TRUE ~ "POST"
)

# game_info <- raw_data$data$viewer$gameDetail

game_id <- raw_data$data$viewer$gameDetail$id
home_team <- raw_data$data$viewer$gameDetail$homeTeam$abbreviation
away_team <- raw_data$data$viewer$gameDetail$visitorTeam$abbreviation
home_team <- data.table::fcase(
home_team == "JAC", "JAX",
home_team == "SD", "LAC",
default = home_team
)
away_team <- data.table::fcase(
away_team == "JAC", "JAX",
away_team == "SD", "LAC",
default = away_team
)

# if home team and away team are the same, the game is messed up and needs fixing
if (home_team == away_team) {
Expand Down Expand Up @@ -95,28 +103,6 @@ get_pbp_nfl <- function(id,
)
}

#fill missing posteam info for this
if (
((home_team %in% c("JAC", "JAX") | away_team %in% c("JAC", "JAX")) & season <= 2015) |
bad_game == 1
) {
plays <- plays %>%
dplyr::mutate(
possessionTeam.abbreviation = stringr::str_extract(plays$prePlayByPlay, '[A-Z]{2,3}(?=\\s)'),
possessionTeam.abbreviation = dplyr::if_else(
.data$possessionTeam.abbreviation %in% c('OUT', 'END', 'NA'),
NA_character_, .data$possessionTeam.abbreviation
),
possessionTeam.abbreviation = dplyr::if_else(
.data$possessionTeam.abbreviation == 'JAX', 'JAC', .data$possessionTeam.abbreviation
)
)

# for these old games, we're making everything JAC instead of JAX
home_team <- dplyr::if_else(home_team == "JAX", "JAC", home_team)
away_team <- dplyr::if_else(away_team == "JAX", "JAC", away_team)
}

drives <- raw_data$data$viewer$gameDetail$drives %>%
dplyr::mutate(ydsnet = .data$yards + .data$yardsPenalized) %>%
# these are already in plays
Expand Down Expand Up @@ -164,6 +150,13 @@ get_pbp_nfl <- function(id,
dplyr::mutate_if(is.logical, as.numeric) %>%
dplyr::mutate_if(is.integer, as.numeric) %>%
dplyr::mutate_if(is.factor, as.character) %>%
# The abbreviations SD <-> LAC and JAC <-> JAX are mixed up in the raw json data
# to make sure team names match, we normalize the names here
# We also remove new line characters esp. from desc
dplyr::mutate_if(
.predicate = is.character,
.funs = ~ team_name_fn(.x) %>% stringr::str_replace_all("[\r\n]", " ") %>% stringr::str_squish()
) %>%
janitor::clean_names() %>%
dplyr::select(-"drive_play_count", -"drive_time_of_possession", -"next_play_type") %>%
dplyr::rename(
Expand Down Expand Up @@ -194,79 +187,17 @@ get_pbp_nfl <- function(id,
season_type = season_type,
play_clock = as.character(.data$play_clock),
st_play_type = as.character(.data$st_play_type),
#if JAC has the ball and scored, make them the scoring team
td_team = dplyr::if_else(
.data$season <= 2015 & .data$posteam %in% c("JAC", "JAX") &
.data$drive_how_ended_description == 'Touchdown' & !is.na(.data$td_team),
'JAC', .data$td_team
),
#if JAC involved in a game and defensive team score, fill in the right team
td_team = dplyr::if_else(
#game involving the jags
.data$season <= 2015 & (.data$home_team %in% c("JAC", "JAX") | .data$away_team %in% c("JAC", "JAX")) &
#defensive TD
.data$drive_how_ended_description != 'Touchdown' & !is.na(.data$td_team),
#if home team has ball, then away team scored, otherwise home team scored
dplyr::if_else(.data$posteam == .data$home_team, .data$away_team, .data$home_team),
.data$td_team
),

# fix muffed punt td in JAC game
td_team = dplyr::if_else(id == "2011_14_TB_JAX" & .data$play_id == 1343, 'JAC', .data$td_team),
td_team = dplyr::if_else(id == "2011_14_TB_JAX" & .data$play_id == 1343 & .data$td_team != "JAX", 'JAX', .data$td_team),

# kickoff return TDs in old JAC games
td_team = dplyr::if_else(id == "2006_14_IND_JAX" & .data$play_id == 2078, 'JAC', .data$td_team),
td_team = dplyr::if_else(id == "2007_17_JAX_HOU" & .data$play_id %in% c(1907, 2042), 'HOU', .data$td_team),
td_team = dplyr::if_else(id == "2008_09_JAX_CIN" & .data$play_id == 3145, 'JAC', .data$td_team),
td_team = dplyr::if_else(id == "2009_15_IND_JAX" & .data$play_id == 1088, 'IND', .data$td_team),
td_team = dplyr::if_else(id == "2010_15_JAX_IND" & .data$play_id == 3848, 'IND', .data$td_team),
td_team = dplyr::if_else(id == "2006_14_IND_JAX" & .data$play_id == 2078 & .data$td_team != "JAX", 'JAX', .data$td_team),
td_team = dplyr::if_else(id == "2007_17_JAX_HOU" & .data$play_id %in% c(1907, 2042) & .data$td_team != "JAX", 'HOU', .data$td_team),
td_team = dplyr::if_else(id == "2008_09_JAX_CIN" & .data$play_id == 3145 & .data$td_team != "JAX", 'JAX', .data$td_team),
td_team = dplyr::if_else(id == "2009_15_IND_JAX" & .data$play_id == 1088 & .data$td_team != "JAX", 'IND', .data$td_team),
td_team = dplyr::if_else(id == "2010_15_JAX_IND" & .data$play_id == 3848 & .data$td_team != "JAX", 'IND', .data$td_team),

# fill in return team for the JAX games
return_team = dplyr::if_else(
!is.na(.data$return_team) & .data$season <= 2015 & (.data$home_team %in% c("JAC", "JAX") | .data$away_team %in% c("JAC", "JAX")),
dplyr::if_else(
# if the home team has the ball, return team is away team (this is before we flip posteam for kickoffs)
.data$posteam == .data$home_team, .data$away_team, .data$home_team
),
.data$return_team
),
fumble_recovery_1_team = dplyr::if_else(
!is.na(.data$fumble_recovery_1_team) & .data$season <= 2015 & (.data$home_team %in% c("JAC", "JAX") | .data$away_team %in% c("JAC", "JAX")),
# assign possession based on fumble_lost
dplyr::case_when(
.data$fumble_lost == 1 & .data$posteam == .data$home_team ~ .data$away_team,
.data$fumble_lost == 1 & .data$posteam == .data$away_team ~ .data$home_team,
.data$fumble_lost == 0 & .data$posteam == .data$home_team ~ .data$home_team,
.data$fumble_lost == 0 & .data$posteam == .data$away_team ~ .data$away_team
),
.data$fumble_recovery_1_team
),
timeout_team = dplyr::if_else(
# if there's a timeout in the affected seasons
!is.na(.data$timeout_team) & .data$season <= 2015 & (.data$home_team %in% c("JAC", "JAX") | .data$away_team %in% c("JAC", "JAX")),
# extract from play description
# make it JAC instead of JAX to be consistent with everything else
dplyr::if_else(
stringr::str_extract(.data$play_description, "(?<=Timeout #[1-3] by )[:upper:]+") == "JAX", "JAC", stringr::str_extract(.data$play_description, "(?<=Timeout #[1-3] by )[:upper:]+")
),
.data$timeout_team
),
# Also fix penalty team for JAC games
penalty_team = dplyr::if_else(
# if there's a penalty_team in the affected seasons
!is.na(.data$penalty_team) & .data$season <= 2015 & (.data$home_team %in% c("JAC", "JAX") | .data$away_team %in% c("JAC", "JAX")),
# extract from play description
# make it JAC instead of JAX to be consistent with everything else
dplyr::if_else(
stringr::str_extract(.data$play_description, "(?<=PENALTY on )[:upper:]{2,3}") == "JAX",
"JAC",
stringr::str_extract(.data$play_description, "(?<=PENALTY on )[:upper:]{2,3}")
),
.data$penalty_team
),
yardline_side = dplyr::if_else(
.data$season <= 2015 & .data$yardline_side == 'JAX',
'JAC', .data$yardline_side
),
time = dplyr::case_when(
id == '2012_04_NO_GB' & .data$play_id == 1085 ~ '3:34',
id == '2012_16_BUF_MIA' & .data$play_id == 2571 ~ '8:31',
Expand All @@ -277,12 +208,6 @@ get_pbp_nfl <- function(id,
# usage of base ifelse is important here for non-scoring games (i.e. early live games)
safety_team = ifelse(.data$safety == 1, .data$scoring_team_abbreviation, NA_character_),

# scoring_team_abbreviation messed up on old Jags games so just assume it's defense team
safety_team = ifelse(
.data$safety == 1 & .data$season <= 2015 & (.data$home_team %in% c("JAC", "JAX") | .data$away_team %in% c("JAC", "JAX")),
ifelse(.data$posteam == .data$home_team, .data$away_team, .data$home_team), .data$safety_team
),

# can't trust the goal_to_go variable so we overwrite it here
goal_to_go = as.integer(stringr::str_detect(tolower(.data$pre_play_by_play), "goal"))

Expand All @@ -296,6 +221,11 @@ get_pbp_nfl <- function(id,
dplyr::filter(
!(is.na(.data$timeout_team) & stringr::str_detect(tolower(.data$play_description), "timeout at|two-minute"))
) %>%
# Data in 2024 pbp introduced separate "plays" for injury updates
# These mess up some of our logic. Since they are useless, we remove them here
dplyr::filter(
!(is.na(.data$timeout_team) & stringr::str_starts(tolower(.data$play_description), "\\*\\* injury update:"))
) %>%
fix_posteams()

# fix for games where home_team == away_team and fields are messed up
Expand Down Expand Up @@ -377,12 +307,12 @@ fix_bad_games <- function(pbp) {
}

fix_posteams <- function(pbp){
# 2023 pbp introduces two new problems
# Data source switch in 2023 introduced new problems
# 1. Definition of posteam on kick offs changed to receiving team. That's our
# definition and we swap teams later.
# 2. Posteam doesn't change on the PAT after defensive TD
#
# We adjust both things here, but only for 2023ff to avoid backwards compatibility problems
# We adjust both things here
# We need the variable pre_play_by_play which usually looks like "KC 1-10 NYJ 40"
if ("pre_play_by_play" %in% names(pbp)){
# Let's be as explicit as possible about what we want to extract from the string
Expand Down
2 changes: 1 addition & 1 deletion R/nflfastR-package.R
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@
#' @import dplyr
#' @importFrom cli rule
#' @importFrom curl curl_fetch_memory
#' @importFrom data.table setDT %between%
#' @importFrom data.table setDT %between% %chin%
#' @import fastrmodels
#' @importFrom furrr future_map_chr future_map_dfr future_map
#' @importFrom future plan
Expand Down
Binary file modified R/sysdata.rda
Binary file not shown.
4 changes: 2 additions & 2 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ knitr::opts_chunk$set(

<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version-last-release/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
[![CRAN downloads](http://cranlogs.r-pkg.org/badges/grand-total/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
[![CRAN downloads](https://cranlogs.r-pkg.org/badges/grand-total/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
[![Dev status](https://img.shields.io/github/r-package/v/nflverse/nflfastR/master?label=dev%20version&style=flat-square&logo=github)](https://www.nflfastr.com/)
[![R-CMD-check](https://github.com/nflverse/nflfastR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/nflverse/nflfastR/actions/workflows/R-CMD-check.yaml)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
Expand Down Expand Up @@ -154,7 +154,7 @@ knitr::include_graphics('man/figures/readme-cp-model-1.png')

* To Nick Shoemaker for [finding and making available JSON-formatted NFL play-by-play back to 1999](https://github.com/CroppedClamp/nfl_pbps) (`nflfastR` uses this source for 1999 and 2000 and previously also used it for 2001-2010)
* To Lau Sze Yui for developing a scraping function to access JSON-formatted NFL play-by-play beginning in 2001
* To Aaron Schatz and Football Outsiders for providing charting data to correctly mark scrambles in the 2005 season
* To Aaron Schatz and [FTN Fantasy](https://ftnfantasy.com/dvoa/nfl) for providing charting data to correctly mark scrambles in the 1999-2005 seasons
* To Lee Sharpe for curating a resource for game information
* To Timo Riske, Lau Sze Yui, Sean Clement, and Daniel Houston for many helpful discussions regarding the development of the new `nflfastR` models
* To Zach Feldman and Josh Hermsmeyer for many helpful discussions about CPOE models as well as Peter Owen for many helpful suggestions for the CP model
Expand Down
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
[![CRAN
status](https://www.r-pkg.org/badges/version-last-release/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
[![CRAN
downloads](http://cranlogs.r-pkg.org/badges/grand-total/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
downloads](https://cranlogs.r-pkg.org/badges/grand-total/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
[![Dev
status](https://img.shields.io/github/r-package/v/nflverse/nflfastR/master?label=dev%20version&style=flat-square&logo=github)](https://www.nflfastr.com/)
[![R-CMD-check](https://github.com/nflverse/nflfastR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/nflverse/nflfastR/actions/workflows/R-CMD-check.yaml)
Expand Down Expand Up @@ -123,8 +123,9 @@ incorporating the pre-game spread.
used it for 2001-2010)
- To Lau Sze Yui for developing a scraping function to access
JSON-formatted NFL play-by-play beginning in 2001
- To Aaron Schatz and Football Outsiders for providing charting data to
correctly mark scrambles in the 2005 season
- To Aaron Schatz and [FTN Fantasy](https://ftnfantasy.com/dvoa/nfl) for
providing charting data to correctly mark scrambles in the 1999-2005
seasons
- To Lee Sharpe for curating a resource for game information
- To Timo Riske, Lau Sze Yui, Sean Clement, and Daniel Houston for many
helpful discussions regarding the development of the new `nflfastR`
Expand Down
Binary file modified data-raw/Scrambles 1999-2004 UPDATE for NFLfastR.xlsx
Binary file not shown.
Binary file modified data-raw/scramble_fix.rds
Binary file not shown.
Binary file modified tests/testthat/2019/2019_01_GB_CHI.rds
Binary file not shown.
Binary file modified tests/testthat/expected_pbp.rds
Binary file not shown.

0 comments on commit 7942ae7

Please sign in to comment.