Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_cumu_hazard does not aggregate hazard values correctly if input data are not sorted correctly #227

Open
fabian-s opened this issue Mar 8, 2023 · 10 comments

Comments

@fabian-s
Copy link
Collaborator

fabian-s commented Mar 8, 2023

library(pammtools)
#> 
#> Attaching package: 'pammtools'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(survival)
library(tidyverse)

veteran <- survival::veteran  |> filter(time < 400)  |> #reduce data for ex
  mutate(trt   = 1L * (trt == 2),
         prior = 1L * (prior == 10))
veteran_ped <- veteran |> 
  as_ped(Surv(time, status)~., id = "id")

# equivalent models:
vet_cox_strata <- coxph(Surv(time, status) ~ strata(celltype),
                        data = veteran)
vet_pam_strata <- pamm(ped_status ~ s(tend) + celltype + s(tend, by = celltype),
                       data = veteran_ped)


pam_fit_correct <- veteran_ped |> ped_info() |> slice(rep(1:n(), each = 4)) |> 
  mutate(celltype = rep(levels(celltype), times = n()/4)) |> 
  group_by(celltype) |> arrange(tstart) |>  # !!!!!!  
  add_cumu_hazard(vet_pam_strata)
#> Warning in warn_about_new_time_points.pamm(object, newdata, time_var): Time
#> points/intervals in new data not equivalent to time points/intervals during
#> model fit. Setting intervals to values not used for original fitcan invalidate
#> the PEM assumption and yield incorrect predictions.

pam_fit_wrong <- veteran_ped |> ped_info() |> slice(rep(1:n(), each = 4)) |> 
  mutate(celltype = rep(levels(celltype), times = n()/4)) |> 
  add_cumu_hazard(vet_pam_strata) 
#> Warning in warn_about_new_time_points.pamm(object, newdata, time_var): Time
#> points/intervals in new data not equivalent to time points/intervals during
#> model fit. Setting intervals to values not used for original fitcan invalidate
#> the PEM assumption and yield incorrect predictions.

cox_fit <- vet_cox_strata |> basehaz() 

ggplot(pam_fit_correct, aes(x = tend, y = cumu_hazard)) + geom_step(aes(col = celltype)) +
  geom_line(data = cox_fit, aes(x = time, y = hazard, col = strata), linetype = 2)

ggplot(pam_fit_wrong, aes(x = tend, y = cumu_hazard)) + geom_step(aes(col = celltype)) +
  geom_line(data = cox_fit, aes(x = time, y = hazard, col = strata), linetype = 2)

Created on 2023-03-08 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 Patched (2022-11-10 r83330)
#>  os       Linux Mint 20
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en_US
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-03-08
#>  pandoc   2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version    date (UTC) lib source
#>  assertthat      0.2.1      2019-03-21 [1] CRAN (R 4.2.2)
#>  backports       1.4.1      2021-12-13 [1] CRAN (R 4.2.2)
#>  broom           1.0.1      2022-08-29 [1] CRAN (R 4.2.2)
#>  cellranger      1.1.0      2016-07-27 [1] CRAN (R 4.2.2)
#>  checkmate       2.1.0      2022-04-21 [1] CRAN (R 4.2.2)
#>  cli             3.6.0      2023-01-09 [1] CRAN (R 4.2.2)
#>  codetools       0.2-18     2020-11-04 [4] CRAN (R 4.0.3)
#>  colorspace      2.1-0      2023-01-23 [1] CRAN (R 4.2.2)
#>  crayon          1.5.2      2022-09-29 [1] CRAN (R 4.2.2)
#>  curl            4.3.3      2022-10-06 [1] CRAN (R 4.2.2)
#>  DBI             1.1.3      2022-06-18 [1] CRAN (R 4.2.2)
#>  dbplyr          2.2.1      2022-06-27 [1] CRAN (R 4.2.2)
#>  digest          0.6.31     2022-12-11 [1] CRAN (R 4.2.2)
#>  dplyr         * 1.0.10     2022-09-01 [1] CRAN (R 4.2.2)
#>  ellipsis        0.3.2      2021-04-29 [1] CRAN (R 4.2.2)
#>  evaluate        0.20       2023-01-17 [1] CRAN (R 4.2.2)
#>  fansi           1.0.3      2022-03-24 [1] CRAN (R 4.2.2)
#>  farver          2.1.1      2022-07-06 [1] CRAN (R 4.2.2)
#>  fastmap         1.1.0      2021-01-25 [1] CRAN (R 4.2.2)
#>  forcats       * 0.5.2      2022-08-19 [1] CRAN (R 4.2.2)
#>  foreach         1.5.2      2022-02-02 [1] CRAN (R 4.2.2)
#>  Formula         1.2-4      2020-10-16 [1] CRAN (R 4.2.2)
#>  fs              1.5.2      2021-12-08 [1] CRAN (R 4.2.2)
#>  future          1.31.0     2023-02-01 [1] CRAN (R 4.2.2)
#>  future.apply    1.10.0     2022-11-05 [1] CRAN (R 4.2.2)
#>  gargle          1.2.1      2022-09-08 [1] CRAN (R 4.2.2)
#>  generics        0.1.3      2022-07-05 [1] CRAN (R 4.2.2)
#>  ggplot2       * 3.4.0      2022-11-04 [1] CRAN (R 4.2.2)
#>  globals         0.16.2     2022-11-21 [1] CRAN (R 4.2.2)
#>  glue            1.6.2      2022-02-24 [1] CRAN (R 4.2.2)
#>  googledrive     2.0.0      2021-07-08 [1] CRAN (R 4.2.2)
#>  googlesheets4   1.0.1      2022-08-13 [1] CRAN (R 4.2.2)
#>  gtable          0.3.1      2022-09-01 [1] CRAN (R 4.2.2)
#>  haven           2.5.1      2022-08-22 [1] CRAN (R 4.2.2)
#>  highr           0.10       2022-12-22 [1] CRAN (R 4.2.2)
#>  hms             1.1.2      2022-08-19 [1] CRAN (R 4.2.2)
#>  htmltools       0.5.3      2022-07-18 [1] CRAN (R 4.2.2)
#>  httr            1.4.4      2022-08-17 [1] CRAN (R 4.2.2)
#>  iterators       1.0.14     2022-02-05 [1] CRAN (R 4.2.2)
#>  jsonlite        1.8.4      2022-12-06 [1] CRAN (R 4.2.2)
#>  knitr           1.42       2023-01-25 [1] CRAN (R 4.2.2)
#>  labeling        0.4.2      2020-10-20 [1] CRAN (R 4.2.2)
#>  lattice         0.20-44    2021-05-02 [4] CRAN (R 4.1.0)
#>  lava            1.7.1      2023-01-06 [1] CRAN (R 4.2.2)
#>  lazyeval        0.2.2      2019-03-15 [1] CRAN (R 4.2.2)
#>  lifecycle       1.0.3      2022-10-07 [1] CRAN (R 4.2.2)
#>  listenv         0.9.0      2022-12-16 [1] CRAN (R 4.2.2)
#>  lubridate       1.9.0      2022-11-06 [1] CRAN (R 4.2.2)
#>  magrittr        2.0.3      2022-03-30 [1] CRAN (R 4.2.2)
#>  Matrix          1.5-3      2022-11-11 [1] CRAN (R 4.2.2)
#>  mgcv            1.8-41     2022-10-21 [1] CRAN (R 4.2.2)
#>  mime            0.12       2021-09-28 [1] CRAN (R 4.2.2)
#>  modelr          0.1.10     2022-11-11 [1] CRAN (R 4.2.2)
#>  munsell         0.5.0      2018-06-12 [1] CRAN (R 4.2.2)
#>  mvtnorm         1.1-3      2021-10-08 [1] CRAN (R 4.2.2)
#>  nlme            3.1-152    2021-02-04 [4] CRAN (R 4.0.3)
#>  numDeriv        2016.8-1.1 2019-06-06 [1] CRAN (R 4.2.2)
#>  pammtools     * 0.5.8      2022-01-09 [1] CRAN (R 4.2.2)
#>  parallelly      1.34.0     2023-01-13 [1] CRAN (R 4.2.2)
#>  pec             2022.05.04 2022-05-04 [1] CRAN (R 4.2.2)
#>  pillar          1.8.1      2022-08-19 [1] CRAN (R 4.2.2)
#>  pkgconfig       2.0.3      2019-09-22 [1] CRAN (R 4.2.2)
#>  prodlim         2019.11.13 2019-11-17 [1] CRAN (R 4.2.2)
#>  purrr         * 1.0.1      2023-01-10 [1] CRAN (R 4.2.2)
#>  R.cache         0.16.0     2022-07-21 [1] CRAN (R 4.2.2)
#>  R.methodsS3     1.8.2      2022-06-13 [1] CRAN (R 4.2.2)
#>  R.oo            1.25.0     2022-06-12 [1] CRAN (R 4.2.2)
#>  R.utils         2.12.2     2022-11-11 [1] CRAN (R 4.2.2)
#>  R6              2.5.1      2021-08-19 [1] CRAN (R 4.2.2)
#>  Rcpp            1.0.9      2022-07-08 [1] CRAN (R 4.2.2)
#>  readr         * 2.1.3      2022-10-01 [1] CRAN (R 4.2.2)
#>  readxl          1.4.1      2022-08-17 [1] CRAN (R 4.2.2)
#>  reprex          2.0.2      2022-08-17 [1] CRAN (R 4.2.2)
#>  rlang           1.0.6      2022-09-24 [1] CRAN (R 4.2.2)
#>  rmarkdown       2.18       2022-11-09 [1] CRAN (R 4.2.2)
#>  rstudioapi      0.14       2022-08-22 [1] CRAN (R 4.2.2)
#>  rvest           1.0.3      2022-08-19 [1] CRAN (R 4.2.2)
#>  scales          1.2.1      2022-08-20 [1] CRAN (R 4.2.2)
#>  sessioninfo     1.2.2      2021-12-06 [1] CRAN (R 4.2.2)
#>  stringi         1.7.8      2022-07-11 [1] CRAN (R 4.2.2)
#>  stringr       * 1.5.0      2022-12-02 [1] CRAN (R 4.2.2)
#>  styler          1.8.1      2022-11-07 [1] CRAN (R 4.2.2)
#>  survival      * 3.2-11     2021-04-26 [4] CRAN (R 4.0.5)
#>  tibble        * 3.1.8      2022-07-22 [1] CRAN (R 4.2.2)
#>  tidyr         * 1.3.0      2023-01-24 [1] CRAN (R 4.2.2)
#>  tidyselect      1.2.0      2022-10-10 [1] CRAN (R 4.2.2)
#>  tidyverse     * 1.3.2      2022-07-18 [1] CRAN (R 4.2.2)
#>  timechange      0.1.1      2022-11-04 [1] CRAN (R 4.2.2)
#>  timereg         2.0.5      2023-01-17 [1] CRAN (R 4.2.2)
#>  tzdb            0.3.0      2022-03-28 [1] CRAN (R 4.2.2)
#>  utf8            1.2.2      2021-07-24 [1] CRAN (R 4.2.2)
#>  vctrs           0.5.2      2023-01-23 [1] CRAN (R 4.2.2)
#>  withr           2.5.0      2022-03-03 [1] CRAN (R 4.2.2)
#>  xfun            0.37       2023-01-31 [1] CRAN (R 4.2.2)
#>  xml2            1.3.3      2021-11-30 [1] CRAN (R 4.2.2)
#>  yaml            2.3.7      2023-01-23 [1] CRAN (R 4.2.2)
#> 
#>  [1] /home/fabians/R/x86_64-pc-linux-gnu-library/4.2
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@adibender
Copy link
Owner

bei Beispiel 2 musst du aber noch zusätzlich nach celltype gruppieren?

@fabian-s
Copy link
Collaborator Author

fabian-s commented Mar 8, 2023

ja - das ist halt eben nicht dokumentiert dass man das machen muss und sonst grütze raus kommt

@adibender
Copy link
Owner

adibender commented Mar 8, 2023

i see yes. Thought about it, not sure how to handle. No way to know that it is grouping factor rather than e.g. time-dependent covariate..
Suggestions? Group automatically within add_cumu_hazard when same timepoint occurs more than once and send message to console?

@adibender
Copy link
Owner

@fabian-s Ideas how to do it technically? check if nrow(unique(newdata) ) != length(unique(tend)) and if so group internally by all variables that are not "PED" variables?

@fabian-s
Copy link
Collaborator Author

fabian-s commented Feb 27, 2024

"group internally by all variables that are not "PED""

and then do arrange(tend)

wouldn't that break if time-dependent covariates with any duplicated values are present, like you said yourself 2 comments above?

if that can be avoided that seems like a reasonable fix.

since this seems fragile/intransparent, i'd suggest to make prepping the input data the user's responsibility:

verify that the input data for add_cumu_hazard has a tend column that only contains strictly increasing sub-sequences of timepoints (exception/complication: need to allow repeats of "sequences of length one" if they only contain tend for the first interval), and if that's not the case throw an informative error with an explanation that a suitably grouped/arranged input is required.

EDIT:

s.th like:

sequence_breaks <- which(diff(newdata$tend) <= 0) + 1
stopifnot(all (newdata$tend[sequence_breaks] == min(newdata$tend))

BTW: code above should probably check against the actual minimal tend used in the model and not just the minimal tend in newdata -- this then also avoids another possible fuck-up for add_cumu_hazard if the input data does not satisfy the implicit assumption that the PED data start at $t_0$ for all groups for which $\Lambda(t)$ is to be generated....

@adibender
Copy link
Owner

adibender commented Feb 27, 2024

wouldn't that break if time-dependent covariates with any duplicated values are present, like you said yourself 2 comments above?

so my thinking was, if the same tend value appears multiple times, than we should group, such that each sequence of non-unique timepoints is one group. In case of TDCs with concurrent effect we'd have multiple values of the covar, but a unique sequence of time-points, so no grouping
there might, however be a problem when we have cumulative effects, it will depend on how the data set used for prediction is structured 🤔

@adibender
Copy link
Owner

yap, probably need to go back to the drawing board...
user could also specify specific time-points to calculate e.g. the survival probability, but we'd need to create the sequence of all previous hazards anyway.
Right now we rely on the user to do the right thing (given the examples in vignettes)

@fabian-s
Copy link
Collaborator Author

there might, however be a problem when we have cumulative effects, it will depend on how the data set used for prediction is structured
user could also specify specific time-points to calculate e.g. the survival probability, but we'd need to create the sequence of all previous hazards anyway.

don't have enough detailed context to know what the PED for cumul. effects etc would look like, but this all seems to reinforce my general point above that it would be cleaner to check the minimal conditions the input data HAS to satisfy instead of trying to guess what the user may have wanted to achieve and then automagically modify the input accordingly -- these conditions includes at least:

  • every subsequence of tend has to start at $t_0$
  • every subsequence of tend includes all intervals in ascending order up to the desired endpoint (so disregard the code i posted above)

@adibender
Copy link
Owner

I think prediction of S(t), H(t), CIF(t) etc. at specific time-points is not an unreasonable request. For example for evaluation measures. We do have this functionality here: https://github.com/adibender/pammtools/blob/master/R/predict.R
but not for the add_* functions.

I'll think about it. Thanks for the input 🙏🏻

@fabian-s
Copy link
Collaborator Author

I think prediction of S(t), H(t), CIF(t) etc. at specific time-points is not an unreasonable request

agree. to enable this, it seems feasible to -- internally, automatically --

  • pad the input data so that it has subsequences of all intervals from $t_0$ up to the user-specified maximal time (i.e, the times at the "sequence_break" indices in my code above)
  • keep track of which time points/rows were added
  • return only the predictions for the user-supplied rows after removing the padded ones ...?

i think this strategy might even have worked with the "pam_fit_wrong" input data in the top example...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants