Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Problematic fixed_drive behavior linked to erroneous changes in posteam #496

Open
2 tasks done
ahmed-cheema opened this issue Nov 26, 2024 · 4 comments
Open
2 tasks done

Comments

@ahmed-cheema
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Have you installed the latest development version of the package(s) in question?

  • I have installed the latest development version of the package.

If this is a data issue, have you tried clearing your nflverse cache?

I have cleared my nflverse cache and the issue persists.

What version of the package do you have?

4.6.1.9020

Describe the bug

Summary

fixed_drive seems to be incremented in cases where it should not be (consecutive timeouts, other misc. instances), likely because of erroneous changes in posteam. This leads to NA values in fixed_drive_result.

I discovered these instances in running this code:

library(tidyverse)
library(nflverse)

nflreadr::.clear_cache()

pbp <- load_pbp(seasons = 1999:2024)

errors <- pbp %>% 
    filter(kickoff_attempt == 0) %>% # cases with a kickoff fumble or kickoff re-do seem to be fine
    select(game_id, fixed_drive, fixed_drive_result) %>% 
    distinct() %>%
    filter(is.na(fixed_drive_result))

There are 134 instances in which fixed_drive_result is NA, which seems like something that shouldn't really be happening. Some situations that seem to result in this:

Consecutive Timeouts

pbp %>% 
    filter(game_id == "1999_15_GB_MIN") %>% 
    filter(drive >= 25) %>% 
    select(drive, fixed_drive, fixed_drive_result, posteam, desc) %>% 
    print(n=300)

image

  • GB throws an interception at row 6.
  • Next play at row 7 has MIN as posteam, and fixed_drive is incremented to 25. That seems correct. But fixed_drive_result is NA, which is odd because the drive ends in a turnover on downs.
  • Row 8 has a timeout. posteam is empty which presumably prompts fixed_drive to be incremented?
  • Row 9 has another timeout. posteam is back to MIN and fixed_drive is incremented again, and now fixed_drive_result is correct.

Not sure how fixed_drive works but I suspect that if it's based on changes in posteam, then the empty string in row 8 messes it up.

Another example:

pbp %>% 
    filter(game_id == "1999_13_SEA_OAK") %>% 
    filter(drive >= 7) %>% 
    select(drive, fixed_drive, fixed_drive_result, posteam, desc) %>% 
    print(n=300)

image

The first of two consecutive timeouts again has posteam as an empty string which leads to fixed_drive being incremented, and plays within the same drive has fixed_drive values of 8 and 10, and half of the drive doesn't have a fixed_drive_result value.

Miscellaneous

pbp %>% 
    filter(game_id == "2010_12_CAR_CLE") %>% 
    filter(drive >= 21) %>% 
    select(drive, fixed_drive, fixed_drive_result, posteam, desc) %>% 
    print(n=300)

image

  • In this case, we have a Cleveland drive beginning at row 6 after an interception on row 5. fixed_drive seems to correctly be 22.
  • At row 11, a review occurs during the Cleveland drive and posteam is switched to CAR, which results in fixed_drive incrementing.
  • In row 12, posteam is again CLE (so fixed_drive increments again) and Cleveland punts. fixed_drive_result is thus Punt.

Rows 6-12 should probably all have fixed_drive set to 22 and fixed_drive_result set to Punt, but the review at row 11 seems to mess things up.

Another example:

pbp %>% 
    filter(game_id == "2002_13_STL_PHI") %>% 
    filter(drive >= 14) %>% 
    select(drive, fixed_drive, fixed_drive_result, posteam, down, ydstogo, desc) %>% 
    print(n=300)

image

  • Row 1: On the first play of fixed_drive 14, posteam PHI fumbles and LA recovers.
  • At row 2, fixed_drive is incremented and posteam is set to LA - this is right.
  • Fast forward to row 8 during the same drive where "Eagles charged a timeout for attending to an injured player". posteam is set to PHI, despite it being the same drive.
  • As expected, this increments fixed_drive, it increments again on the next row when posteam is again LA, and fixed_drive_result for previous rows is NA when it should be Field goal.

Reprex

library(tidyverse)
library(nflverse)

nflreadr::.clear_cache()

pbp <- load_pbp(seasons = 1999:2024)

errors <- pbp %>% 
    filter(kickoff_attempt == 0) %>% # cases with a kickoff fumble or kickoff re-do seem to be fine
    select(game_id, fixed_drive, fixed_drive_result) %>% 
    distinct() %>%
    filter(is.na(fixed_drive_result))

Expected Behavior

Seems like posteam should not be changing within these drives, which would keep fixed_drive consistent throughout the drive

nflverse_sitrep

── System Info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• R version 4.4.2 (2024-10-31 ucrt) • Running under: Windows 11 x64 (build 22631)
── Package Status ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   package  installed  cran        dev behind
1   nfl4th 1.0.4.9005 1.0.4 1.0.4.9005       
2 nflfastR 4.6.1.9020 4.6.1 4.6.1.9020       
3 nflplotR 1.4.0.9001 1.4.0 1.4.0.9001       
4 nflreadr   1.4.1.05 1.4.1   1.4.1.05       
5 nflseedR 1.2.0.9901 1.2.0 1.2.0.9901       
6 nflverse      1.0.3 1.0.3      1.0.3       
── Package Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• No options set for above packages
── Package Dependencies ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• askpass     (1.2.1)    • htmlwidgets  (1.6.4)     • snakecase   (0.11.1)   
• backports   (1.5.0)    • httr         (1.4.7)     • stringi     (1.8.4)    
• base64enc   (0.1-3)    • isoband      (0.2.7)     • stringr     (1.5.1)    
• bigD        (0.3.0)    • janitor      (2.2.0)     • sys         (3.4.3)    
• bitops      (1.0-9)    • jquerylib    (0.1.4)     • tibble      (3.2.1)    
• bslib       (0.8.0)    • jsonlite     (1.8.9)     • tidyr       (1.3.1)    
• cachem      (1.1.0)    • juicyjuice   (0.1.0)     • tidyselect  (1.2.1)    
• cli         (3.6.3)    • knitr        (1.49)      • timechange  (0.3.0)    
• colorspace  (2.1-1)    • labeling     (0.4.3)     • tinytex     (0.54)     
• commonmark  (1.9.2)    • lifecycle    (1.0.4)     • utf8        (1.2.4)    
• cpp11       (0.5.0)    • listenv      (0.9.1)     • V8          (6.0.0)    
• crayon      (1.5.3)    • lubridate    (1.9.3)     • vctrs       (0.6.5)    
• curl        (6.0.1)    • magick       (2.8.5)     • viridisLite (0.4.2)    
• data.table  (1.16.2)   • magrittr     (2.0.3)     • withr       (3.0.2)    
• digest      (0.6.37)   • markdown     (1.13)      • xfun        (0.49)     
• dplyr       (1.1.4)    • memoise      (2.0.1)     • xgboost     (1.7.8.1)  
• evaluate    (1.0.1)    • mime         (0.12)      • xml2        (1.3.6)    
• fansi       (1.0.6)    • munsell      (0.5.1)     • yaml        (2.3.10)   
• farver      (2.1.2)    • openssl      (2.2.2)     • codetools   (0.2-20)   
• fastmap     (1.2.0)    • parallelly   (1.39.0)    • compiler    (4.4.2)    
• fastrmodels (1.0.2)    • pillar       (1.9.0)     • graphics    (4.4.2)    
• fontawesome (0.5.3)    • pkgconfig    (2.0.3)     • grDevices   (4.4.2)    
• fs          (1.6.5)    • progressr    (0.15.1)    • grid        (4.4.2)    
• furrr       (0.3.1)    • proto        (1.0.0)     • lattice     (0.22-6)   
• future      (1.34.0)   • purrr        (1.0.2)     • MASS        (7.3-61)   
• generics    (0.1.3)    • R6           (2.5.1)     • Matrix      (1.7-1)    
• ggpath      (1.0.2)    • rappdirs     (0.3.3)     • methods     (4.4.2)    
• ggplot2     (3.5.1)    • RColorBrewer (1.1-3)     • mgcv        (1.9-1)    
• globals     (0.16.3)   • Rcpp         (1.0.13-1)  • nlme        (3.1-166)  
• glue        (1.8.0)    • reactable    (0.4.4)     • parallel    (4.4.2)    
• gsubfn      (0.7)      • reactR       (0.6.1)     • splines     (4.4.2)    
• gt          (0.11.1)   • rlang        (1.1.4)     • stats       (4.4.2)    
• gtable      (0.3.6)    • rmarkdown    (2.29)      • tools       (4.4.2)    
• highr       (0.11)     • rstudioapi   (0.17.1)    • utils       (4.4.2)    
• hms         (1.1.3)    • sass         (0.4.9)       
• htmltools   (0.5.8.1)  • scales       (1.3.0)       
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Screenshots

No response

Additional context

No response

@mrcaseb
Copy link
Member

mrcaseb commented Nov 26, 2024

Some initial thoughts on this:

  • most problematic drives are in the 1999 and 2000 seasons. Data quality of those seasons is significantly worse compared to 2001+. I am not sure we are going to fix this
  • 50 NA drive results in 2001+ are the very first row of the game with play description "GAME". We could try to avoid setting fixed_drive for those plays but it hurts nobody and the subsequent drives are counted correctly
  • There are all sorts of weird things in the remaining 35 plays mostly related to special teams plays. Most of them seem to be accurate.
  • The actual interesting plays are the ones where posteam randomly changes on timeouts, reviews or other abnormal "plays"

@mrcaseb
Copy link
Member

mrcaseb commented Nov 26, 2024

Oh and the first drive of 2019_15_CLE_ARI which we recently learned is completely broken in our source

@mrcaseb
Copy link
Member

mrcaseb commented Nov 26, 2024

So here is code to filter down to relevant games

pbp_db |> 
  filter(season >= 2001, is.na(fixed_drive_result), desc != "GAME") |> 
  filter(n() > 1, .by = game_id) |> 
  select(game_id, play_id, posteam, fixed_drive, desc) |> 
  collect() |> 
  gt(groupname_col = "game_id") |> 
  tab_options(
    row_group.background.color = "gray"
  )

image

@mrcaseb
Copy link
Member

mrcaseb commented Nov 26, 2024

The 2002 TEN OAK game is just two return TDs followed by some random play description with swapped posteam. We can ignore this.

So it comes down to 2002_13_STL_PHI and 2010_12_CAR_CLE. I think there is no easy fix other than hard coding posteam of those two plays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants