Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] nflfastr::calculate_player_stats returns duplicate rows for defense and kicker #476

Closed
1 task done
isaactpetersen opened this issue Jul 31, 2024 · 4 comments · Fixed by #470
Closed
1 task done

Comments

@isaactpetersen
Copy link

isaactpetersen commented Jul 31, 2024

Is there an existing issue for this?

  • I have searched the existing issues

If this is a data issue, have you tried clearing your nflverse cache?

I have cleared my nflverse cache and the issue persists.

What version of the package do you have?

nflreadr 1.4.1

Describe the bug

There are duplicated combinations of player_id-season-week combinations in the player stats database (from the load_player_stats() function). I cannot think of a reason why the same player would have multiple rows for a given season and week combination. If (as I suspect), this is not possible, then this would be a data issue to fix. If I'm incorrect and it is plausible that the same player could have multiple rows for a given season and week combination, then it would be helpful to know the circumstances when this could arise. This is important for merging with other datasets to ensure I am merging the information to the correct player_id-season-week combination.

Reprex

library("nflreadr")
library("dplyr")
#> Warning: package 'dplyr' was built under R version 4.3.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Load Data
offenseStats_weekly <- load_player_stats(
    seasons = TRUE,
    stat_type = "offense")

defenseStats_weekly <- load_player_stats(
    seasons = TRUE,
    stat_type = "defense")

kickingStats_weekly <- load_player_stats(
    seasons = TRUE,
    stat_type = "kicking")

# Rearrange variables
offenseStats_weekly <- offenseStats_weekly %>% 
  select(player_id, season, week, everything())

defenseStats_weekly <- defenseStats_weekly %>% 
  select(player_id, season, week, everything())

kickingStats_weekly <- kickingStats_weekly %>% 
  select(player_id, season, week, everything())

# Offense: No duplicate id-season-week combinations
offenseStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1)
#> # A tibble: 0 × 53
#> # Groups:   player_id, season, week [0]
#> # ℹ 53 variables: player_id <chr>, season <int>, week <int>, player_name <chr>,
#> #   player_display_name <chr>, position <chr>, position_group <chr>,
#> #   headshot_url <chr>, recent_team <chr>, season_type <chr>,
#> #   opponent_team <chr>, completions <int>, attempts <int>,
#> #   passing_yards <dbl>, passing_tds <int>, interceptions <dbl>, sacks <dbl>,
#> #   sack_yards <dbl>, sack_fumbles <int>, sack_fumbles_lost <int>,
#> #   passing_air_yards <dbl>, passing_yards_after_catch <dbl>, …

# Defense
defenseStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1)
#> # A tibble: 496 × 32
#> # Groups:   player_id, season, week [183]
#>    player_id season  week season_type player_name player_display_name position
#>    <chr>      <int> <int> <chr>       <chr>       <chr>               <chr>   
#>  1 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  2 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  3 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  4 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  5 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  6 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  7 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  8 0           1999     1 REG         <NA>        <NA>                <NA>    
#>  9 0           1999     2 REG         <NA>        <NA>                <NA>    
#> 10 0           1999     2 REG         <NA>        <NA>                <NA>    
#> # ℹ 486 more rows
#> # ℹ 25 more variables: position_group <chr>, headshot_url <chr>, team <chr>,
#> #   def_tackles <int>, def_tackles_solo <int>, def_tackles_with_assist <int>,
#> #   def_tackle_assists <int>, def_tackles_for_loss <int>,
#> #   def_tackles_for_loss_yards <dbl>, def_fumbles_forced <int>,
#> #   def_sacks <dbl>, def_sack_yards <dbl>, def_qb_hits <dbl>,
#> #   def_interceptions <dbl>, def_interception_yards <dbl>, …

defenseStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1, player_id != 0) #not sure why there are playerIDs of "0"; exclude them
#> # A tibble: 296 × 32
#> # Groups:   player_id, season, week [148]
#>    player_id  season  week season_type player_name player_display_name position
#>    <chr>       <int> <int> <chr>       <chr>       <chr>               <chr>   
#>  1 00-0002919   1999     4 REG         <NA>        Corey Chavous       SS      
#>  2 00-0002919   1999     4 REG         <NA>        Corey Chavous       SS      
#>  3 00-0004543   1999    12 REG         <NA>        Shane Dronett       DT      
#>  4 00-0004543   1999    12 REG         <NA>        Shane Dronett       DT      
#>  5 00-0004915   1999    16 REG         <NA>        Bobby Engram        WR      
#>  6 00-0004915   1999    16 REG         <NA>        Bobby Engram        WR      
#>  7 00-0010668   1999    20 POST        <NA>        Keenan McCardell    WR      
#>  8 00-0010668   1999    20 POST        <NA>        Keenan McCardell    WR      
#>  9 00-0011392   1999    14 REG         <NA>        Basil Mitchell      RB      
#> 10 00-0011392   1999    14 REG         <NA>        Basil Mitchell      RB      
#> # ℹ 286 more rows
#> # ℹ 25 more variables: position_group <chr>, headshot_url <chr>, team <chr>,
#> #   def_tackles <int>, def_tackles_solo <int>, def_tackles_with_assist <int>,
#> #   def_tackle_assists <int>, def_tackles_for_loss <int>,
#> #   def_tackles_for_loss_yards <dbl>, def_fumbles_forced <int>,
#> #   def_sacks <dbl>, def_sack_yards <dbl>, def_qb_hits <dbl>,
#> #   def_interceptions <dbl>, def_interception_yards <dbl>, …

# Kicking

kickingStats_weekly %>% 
  group_by(player_id, season, week) %>% 
  filter(n() > 1)
#> # A tibble: 4 × 44
#> # Groups:   player_id, season, week [2]
#>   player_id  season  week season_type team  player_name player_display_name
#>   <chr>       <int> <int> <chr>       <chr> <chr>       <chr>              
#> 1 00-0004811   2000    11 REG         DEN   <NA>        Jason Elam         
#> 2 00-0004811   2000    11 REG         LV    <NA>        Jason Elam         
#> 3 00-0012875   2002     4 REG         PIT   <NA>        Todd Peterson      
#> 4 00-0012875   2002     4 REG         PIT   <NA>        Todd Peterson      
#> # ℹ 37 more variables: position <chr>, position_group <chr>,
#> #   headshot_url <chr>, fg_made <int>, fg_att <dbl>, fg_missed <int>,
#> #   fg_blocked <int>, fg_long <dbl>, fg_pct <dbl>, fg_made_0_19 <int>,
#> #   fg_made_20_29 <int>, fg_made_30_39 <int>, fg_made_40_49 <int>,
#> #   fg_made_50_59 <int>, fg_made_60_ <int>, fg_missed_0_19 <int>,
#> #   fg_missed_20_29 <int>, fg_missed_30_39 <int>, fg_missed_40_49 <int>,
#> #   fg_missed_50_59 <int>, fg_missed_60_ <int>, fg_made_list <chr>, …

sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4    nflreadr_1.4.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.5       cli_3.6.3         knitr_1.48        rlang_1.1.4      
#>  [5] xfun_0.46         generics_0.1.3    data.table_1.15.4 glue_1.7.0       
#>  [9] htmltools_0.5.8.1 fansi_1.0.6       rmarkdown_2.27    evaluate_0.24.0  
#> [13] tibble_3.2.1      fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
#> [17] memoise_2.0.1     compiler_4.3.1    fs_1.6.4          pkgconfig_2.0.3  
#> [21] rstudioapi_0.16.0 digest_0.6.36     R6_2.5.1          tidyselect_1.2.1 
#> [25] reprex_2.1.1      utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3   
#> [29] tools_4.3.1       withr_3.0.0       cachem_1.1.0

Created on 2024-07-31 with reprex v2.1.1

Expected Behavior

I expect each player (i.e., player_id) to have only one row for a given season-week combination.

nflverse_sitrep

> nflreadr::nflverse_sitrep()
── System Info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• R version 4.3.1 (2023-06-16 ucrt) • Running under: Windows 11 x64 (build 22631)
── Package Status ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   package installed  cran        dev behind
1   nfl4th     1.0.4 1.0.4 1.0.4.9002    dev
2 nflfastR     4.6.1 4.6.1 4.6.1.9010    dev
3 nflplotR     1.3.1 1.3.1      1.3.1       
4 nflreadr     1.4.1 1.4.1   1.4.1.00       
── Package Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• No options set for above packages
── Package Dependencies ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• askpass     (1.2.0)    • httr         (1.4.7)   • stringi     (1.8.4)       
• backports   (1.5.0)    • isoband      (0.2.7)   • stringr     (1.5.1)       
• base64enc   (0.1-3)    • janitor      (2.2.0)   • sys         (3.4.2)       
• bigD        (0.2.0)    • jquerylib    (0.1.4)   • tibble      (3.2.1)       
• bitops      (1.0-8)    • jsonlite     (1.8.8)   • tidyr       (1.3.1)       
• bslib       (0.8.0)    • juicyjuice   (0.1.0)   • tidyselect  (1.2.1)       
• cachem      (1.1.0)    • knitr        (1.48)    • timechange  (0.3.0)       
• cli         (3.6.3)    • labeling     (0.4.3)   • tinytex     (0.52)        
• colorspace  (2.1-1)    • lifecycle    (1.0.4)   • utf8        (1.2.4)       
• commonmark  (1.9.1)    • listenv      (0.9.1)   • V8          (4.4.2)       
• cpp11       (0.4.7)    • lubridate    (1.9.3)   • vctrs       (0.6.5)       
• curl        (5.2.1)    • magick       (2.8.4)   • viridisLite (0.4.2)       
• data.table  (1.15.4)   • magrittr     (2.0.3)   • withr       (3.0.0)       
• digest      (0.6.36)   • markdown     (1.13)    • xfun        (0.46)        
• dplyr       (1.1.4)    • Matrix       (1.6-5)   • xgboost     (1.7.8.1)     
• evaluate    (0.24.0)   • memoise      (2.0.1)   • xml2        (1.3.6)       
• fansi       (1.0.6)    • mime         (0.12)    • yaml        (2.3.10)      
• farver      (2.1.2)    • munsell      (0.5.1)   • codetools   (0.2-20)      
• fastmap     (1.2.0)    • openssl      (2.2.0)   • compiler    (4.3.1)       
• fastrmodels (1.0.2)    • parallelly   (1.38.0)  • graphics    (4.3.1)       
• fontawesome (0.5.2)    • pillar       (1.9.0)   • grDevices   (4.3.1)       
• fs          (1.6.4)    • pkgconfig    (2.0.3)   • grid        (4.3.1)       
• furrr       (0.3.1)    • progressr    (0.14.0)  • lattice     (0.22-6)      
• future      (1.34.0)   • purrr        (1.0.2)   • MASS        (7.3-60.0.1)  
• generics    (0.1.3)    • R6           (2.5.1)   • Matrix      (1.6-5)       
• ggpath      (1.0.1)    • rappdirs     (0.3.3)   • methods     (4.3.1)       
• ggplot2     (3.5.1)    • RColorBrewer (1.1-3)   • mgcv        (1.9-1)       
• globals     (0.16.3)   • Rcpp         (1.0.13)  • nlme        (3.1-165)     
• glue        (1.7.0)    • reactable    (0.4.4)   • parallel    (4.3.1)       
• gt          (0.11.0)   • reactR       (0.6.0)   • splines     (4.3.1)       
• gtable      (0.3.5)    • rlang        (1.1.4)   • stats       (4.3.1)       
• highr       (0.11)     • rmarkdown    (2.27)    • tools       (4.3.1)       
• hms         (1.1.3)    • sass         (0.4.9)   • utils       (4.3.1)       
• htmltools   (0.5.8.1)  • scales       (1.3.0)     
• htmlwidgets (1.6.4)    • snakecase    (0.11.1)    
── Not Installed ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
• nflseedR ()
• nflverse ()

Screenshots

No response

Additional context

No response

@tanho63 tanho63 changed the title [BUG] <title> [BUG] nflfastr::calculate_player_stats returns duplicate rows for defense and kicker Jul 31, 2024
@tanho63
Copy link
Member

tanho63 commented Jul 31, 2024

Relocating to nflfastR repo

@tanho63 tanho63 transferred this issue from nflverse/nflreadr Jul 31, 2024
@mrcaseb
Copy link
Member

mrcaseb commented Aug 1, 2024

Looking at the problematic defense data. It seems like players get attributed to the opponent team in some cases when they get a fumble recovery or penalty.

CORRECTION: I think we assign tackles after turnovers to the wrong team

So the main thing might be that an offensive player scores a defensive stat after the offense turned over the ball

@mrcaseb
Copy link
Member

mrcaseb commented Aug 1, 2024

This might be quite hard to fix and we should probably invest the time in #470 instead

@mrcaseb
Copy link
Member

mrcaseb commented Oct 16, 2024

We will deprecate calculate_player_stats_*() functions in a future release. The new function calculate_stats() (#470 ) will fix the issue

@mrcaseb mrcaseb linked a pull request Oct 16, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants