-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very long running time for survival on a static regime #9
Comments
Hi Oleg, Sorry it's taking a long time to run. I'm not sure how much I can help right now. It seems like even if we speed up ltmle (without SuperLearner), once you start calling SuperLearner, I would guess that SuperLearner is going to take most of the time. You're going to call SuperLearner at least 5 times per time point (3 C nodes, 1 A node, 1 or more LY nodes), so I would think that 85+ calls to SuperLearner with n=50k and 95 columns is going to be much slower than all the rest of the ltmle code. Nonetheless, if you want to try to speed up ltmle, you can just take out CleanData (assuming your data already conforms - it does if you're not seeing the "Note: for internal purposes, all nodes after a censoring event..." message). CleanData takes about 16 mins on my (old) computer. It looks to me like ConvertCensoringNodesToBinary takes less than one second, I don't know why the profiler is saying it's taking a lot of time. XMatch gets called a lot, so I'm not sure there's an easy fix there. Here's the code I used to try to replicate the speed issues: set.seed(1) Anodes <- grep("^A", names(data)) data <- ltmle:::ConvertCensoringNodes(data, Cnodes, has.deterministic.functions=F) Josh From: osofr [email protected] ltmle() takes too much time to run for end of follow-up time-point survival on a static regime (no SuperLearner). Full data run takes about 80min and 20GB of RAM. It appears most of the time is spent in ConvertCensoringNodesToBinary(), CleanData() and XMatch() functions and the running time is approximately linear in N (5K subsample takes 8min). My goal is to eventually run MSMs on survival at each of 17 time points with SuperLearning, which would take too long with current run times. Coding censoring variables as factors (per documentation) or as binaries has no effect on performance.
|
Hi Josh, Thanks for a thoughtful reply. ConvertCensoringNodesToBinary is definitely a big bottleneck on my dataset, so its clearly something specific to the data I am working with. I will try to simulate this scenario to see if I can replicate the profiler results from the actual data. A somewhat unrelated note. Memoise function is applied to a glm object, which stores tons of unnecessary information (including the entire dataset). The only thing that is needed, if I understand it correctly, is the result of predict.glm for a given design matrix, which is just a vector. Isn't it possible to wrap glm and predict into one function that returns prediction vector and memoise that instead? Also, a question. How easy is it to parallelize the SuperLearner? Same thing for the ltmle package, for example in MSM estimation, how easy do you think it would be to parallelize estimation for each survival time point? I have access to a server with a lot of cores so parallelizing could give a big boost in performance. Thanks, |
Hi Oleg, Memoise doesn't apply to ltmle, only to ltmleMSM. But if you're using ltmleMSM, I agree that the memoise section is not well written - it's just a temporary hack. I'm planning on removing memoise entirely in a future release - it shouldn't be needed if I rewrite a few other functions to reuse g. SuperLearner has some parallelized versions - mcSuperLearner and snowSuperLearner - see ?SuperLearner. I haven't used them, but it looks like you could make a minor change to ltmle:::Estimate to have them called. Parellelizing ltmleMSM would take a little work, but is doable. You'd want to parrellize the final.Ynodes loop in MainCalcs (if using the pooled MSM) or NonpooledMSM (if not). But I would guess that if you get all of the available cores working on SuperLearner, that's going to be 90% of the speed benefit. I haven't used the ltmle package on datasets as large as yours, so I'm glad you're trying it out and identifying things to improve. thanks, From: osofr [email protected] Hi Josh, |
ltmle() takes too much time to run for end of follow-up time-point survival on a static regime (no SuperLearner). Full data run takes about 80min and 20GB of RAM. It appears most of the time is spent in ConvertCensoringNodesToBinary(), CleanData() and XMatch() functions and the running time is approximately linear in N (5K subsample takes 8min). My goal is to eventually run MSMs on survival at each of 17 time points with SuperLearning, which would take too long with current run times. Coding censoring variables as factors (per documentation) or as binaries has no effect on performance.
Any advice on what could be causing this slow down and how to fix it? Unfortunately, I can't share the data, but would be happy to run any tests / give more details. See below detailed description of the problem and some ltmle.R profiling results.
Thanks,
Oleg
Data:
50K observations (N), 17 time points (t), 60 baseline covariates (W), 35 time-dep covariates (L_t), time-dep treatment (A_t), 3 types of censoring (Ct_1,Ct_2,Ct_3), survival outcome (Y_t)
Modeling:
Models Q_t and g_t are specified for each time point (one Q_t model for each LY block), both depend only on baseline and previous time-point covariates
Running ltmle:
Running tlmle() function with a static regime, abar=(1,...,1) and no stratification, setstrat=FALSE.
Profiling ltmle.R by line on a subsample of 5K observations:
(8 min run time)
The text was updated successfully, but these errors were encountered: