Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate_cells takes too long #110

Open
MaximilianNuber opened this issue Jun 27, 2024 · 2 comments
Open

aggregate_cells takes too long #110

MaximilianNuber opened this issue Jun 27, 2024 · 2 comments

Comments

@MaximilianNuber
Copy link

Dear Dr. Mangiola,

Thank you for the very nice package. I am working with large scale single cell RNA seq data and wnat to use tidySingleCellExperiment.
I discovered that aggregate_cells takes very long, as compared to aggregateAcrossCells.

As I am usually working on a server, I recreated the problem with a 225k cell dataset on my laptop:
https://cellxgene.cziscience.com/e/dea717d4-7bc0-4e46-950f-fd7e1cc8df7d.cxg/

require(tidySingleCellExperiment)
require(tidySummarizedExperiment)
#setwd("/Users/maximiliannuber/Documents/CSAMA_2024")
sce <- readr::read_rds("Seurat_kidney.rds")
sce <- as.SingleCellExperiment(sce)

aggregateAcrossCells runs fast:

system.time(pbulk <- aggregateAcrossCells(sce, ids = colData(sce)[, c("donor_id", "cell_type")]))
 user  system elapsed 
 11.690   2.481  16.056 

This code ran very long and I interrupted after about 10 minutes.

system.time(pbulk <- aggregateAcrossCells(sce, ids = colData(sce)[, c("donor_id", "cell_type")]))

I looked at this with Michael Love, and we found this may be an issue with the combination of donor and cell type.
This code took just a few seconds:

system.time(
        
        pbulk <- sce %>% 
        aggregate_cells(cell_type, assays="counts")
        
        )
 user  system elapsed 
 10.164   2.333  13.953 

Thank you for any help!

output of sessionInfo:

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.2.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Rome
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidySummarizedExperiment_1.14.0 ttservice_0.4.1                
 [3] tidyr_1.3.1                     tidySingleCellExperiment_1.14.0
 [5] muscData_1.18.0                 ExperimentHub_2.12.0           
 [7] AnnotationHub_3.12.0            BiocFileCache_2.12.0           
 [9] dbplyr_2.5.0                    rpx_2.12.0                     
[11] edgeR_4.2.0                     stringr_1.5.1                  
[13] pheatmap_1.0.12                 celldex_1.14.0                 
[15] SingleR_2.6.0                   igraph_2.0.3                   
[17] GGally_2.2.1                    NewWave_1.14.0                 
[19] scry_1.16.0                     scDblFinder_1.18.0             
[21] scran_1.32.0                    scater_1.32.0                  
[23] ggplot2_3.5.1                   EnsDb.Hsapiens.v86_2.99.0      
[25] ensembldb_2.28.0                AnnotationFilter_1.28.0        
[27] GenomicFeatures_1.56.0          AnnotationDbi_1.66.0           
[29] scuttle_1.14.0                  DropletUtils_1.24.0            
[31] SingleCellExperiment_1.26.0     SummarizedExperiment_1.34.0    
[33] GenomicRanges_1.56.0            GenomeInfoDb_1.40.0            
[35] IRanges_2.38.0                  S4Vectors_0.42.0               
[37] MatrixGenerics_1.16.0           matrixStats_1.3.0              
[39] DropletTestFiles_1.14.0         dplyr_1.1.4                    
[41] limma_3.60.3                    RcppSpdlog_0.0.17              
[43] Seurat_5.0.3                    cellxgene.census_1.14.1        
[45] SeuratObject_5.0.1              sp_2.1-4                       
[47] GEOquery_2.72.0                 Biobase_2.64.0                 
[49] BiocGenerics_0.50.0            

loaded via a namespace (and not attached):
  [1] R.methodsS3_1.8.2         vroom_1.6.5               RcppCCTZ_0.2.12          
  [4] spdl_0.0.5                goftest_1.2-3             Biostrings_2.72.1        
  [7] HDF5Array_1.32.0          vctrs_0.6.5               spatstat.random_3.2-3    
 [10] digest_0.6.35             png_0.1-8                 aws.signature_0.6.0      
 [13] gypsum_1.0.1              tiledb_0.27.0             ggrepel_0.9.5            
 [16] deldir_2.0-4              parallelly_1.37.1         MASS_7.3-60.2            
 [19] reshape2_1.4.4            httpuv_1.6.15             withr_3.0.0              
 [22] xfun_0.43                 aws.s3_0.3.21             ellipsis_0.3.2           
 [25] survival_3.5-8            memoise_2.0.1             ggbeeswarm_0.7.2         
 [28] zoo_1.8-12                pbapply_1.7-2             R.oo_1.26.0              
 [31] KEGGREST_1.44.1           promises_1.3.0            httr_1.4.7               
 [34] restfulr_0.0.15           globals_0.16.3            fitdistrplus_1.1-11      
 [37] rhdf5filters_1.16.0       ps_1.7.6                  rhdf5_2.48.0             
 [40] rstudioapi_0.16.0         nanotime_0.3.7            UCSC.utils_1.0.0         
 [43] miniUI_0.1.1.1            generics_0.1.3            processx_3.8.4           
 [46] base64enc_0.1-3           curl_5.2.1                zlibbioc_1.50.0          
 [49] ScaledMatrix_1.12.0       polyclip_1.10-6           glmpca_0.2.0             
 [52] GenomeInfoDbData_1.2.12   SparseArray_1.4.3         desc_1.4.3               
 [55] xtable_1.8-4              evaluate_0.23             S4Arrays_1.4.0           
 [58] hms_1.1.3                 irlba_2.3.5.1             colorspace_2.1-0         
 [61] filelock_1.0.3            ROCR_1.0-11               reticulate_1.36.1        
 [64] spatstat.data_3.0-4       magrittr_2.0.3            lmtest_0.9-40            
 [67] readr_2.1.5               nanoarrow_0.4.0.1         later_1.3.2              
 [70] viridis_0.6.5             lattice_0.22-6            spatstat.geom_3.2-9      
 [73] future.apply_1.11.2       scattermore_1.2           XML_3.99-0.16.1          
 [76] triebeard_0.4.1           cowplot_1.1.3             RcppAnnoy_0.0.22         
 [79] pillar_1.9.0              nlme_3.1-164              sna_2.7-2                
 [82] compiler_4.4.0            beachmat_2.20.0           RSpectra_0.16-1          
 [85] stringi_1.8.3             tensor_1.5                GenomicAlignments_1.40.0 
 [88] plyr_1.8.9                crayon_1.5.2              abind_1.4-5              
 [91] BiocIO_1.14.0             locfit_1.5-9.9            bit_4.0.5                
 [94] codetools_0.2-20          BiocSingular_1.20.0       alabaster.ranges_1.4.1   
 [97] plotly_4.10.4             mime_0.12                 intergraph_2.0-4         
[100] splines_4.4.0             Rcpp_1.0.12               fastDummies_1.7.3        
[103] sparseMatrixStats_1.16.0  knitr_1.46                blob_1.2.4               
[106] utf8_1.2.4                BiocVersion_3.19.1        fs_1.6.4                 
[109] listenv_0.9.1             DelayedMatrixStats_1.26.0 pkgbuild_1.4.4           
[112] tibble_3.2.1              Matrix_1.7-0              callr_3.7.6              
[115] statmod_1.5.0             tzdb_0.4.0                network_1.18.2           
[118] pkgconfig_2.0.3           tools_4.4.0               cachem_1.0.8             
[121] RSQLite_2.3.7             viridisLite_0.4.2         DBI_1.2.2                
[124] fastmap_1.1.1             rmarkdown_2.26            scales_1.3.0             
[127] grid_4.4.0                ica_1.0-3                 Rsamtools_2.20.0         
[130] coda_0.19-4.1             patchwork_1.2.0           ggstats_0.6.0            
[133] BiocManager_1.30.23       dotCall64_1.1-1           alabaster.schemas_1.4.0  
[136] RANN_2.6.1                farver_2.1.1              yaml_2.3.8               
[139] rtracklayer_1.64.0        cli_3.6.2                 purrr_1.0.2              
[142] leiden_0.4.3.1            lifecycle_1.0.4           uwot_0.2.2               
[145] arrow_16.1.0              bluster_1.14.0            BiocParallel_1.38.0      
[148] gtable_0.3.5              rjson_0.2.21              ggridges_0.5.6           
[151] progressr_0.14.0          parallel_4.4.0            jsonlite_1.8.8           
[154] RcppHNSW_0.6.0            bitops_1.0-7              bit64_4.0.5              
[157] assertthat_0.2.1          xgboost_1.7.7.1           Rtsne_0.17               
[160] alabaster.matrix_1.4.1    spatstat.utils_3.0-4      BiocNeighbors_1.22.0     
[163] urltools_1.7.3            alabaster.se_1.4.1        metapod_1.12.0           
[166] dqrng_0.3.2               R.utils_2.12.3            alabaster.base_1.4.1     
[169] lazyeval_0.2.2            shiny_1.8.1.1             htmltools_0.5.8.1        
[172] sctransform_0.4.1         rappdirs_0.3.3            glue_1.7.0               
[175] spam_2.10-0               httr2_1.0.1               XVector_0.44.0           
[178] RCurl_1.98-1.14           gridExtra_2.3             tiledbsoma_1.11.1        
[181] R6_2.5.1                  DESeq2_1.44.0             labeling_0.4.3           
[184] SharedObject_1.18.0       cluster_2.1.6             pkgload_1.3.4            
[187] Rhdf5lib_1.26.0           statnet.common_4.9.0      DelayedArray_0.30.1      
[190] tidyselect_1.2.1          vipor_0.4.7               ProtGenerics_1.36.0      
[193] xml2_1.3.6                future_1.33.2             rsvd_1.0.5               
[196] munsell_0.5.1             KernSmooth_2.23-22        data.table_1.15.4        
[199] htmlwidgets_1.6.4         RColorBrewer_1.1-3        rlang_1.1.3              
[202] spatstat.sparse_3.0-3     spatstat.explore_3.2-7    remotes_2.5.0            
[205] fansi_1.0.6               beeswarm_0.4.0    

Thanks!

@MaximilianNuber
Copy link
Author

My apologies. I copied the wrong chunk for the actual example.
This following chunk takes longer than 10 min.:

system.time(pbulk <- sce %>% 
        aggregate_cells(c(donor_id, cell_type), assays="counts"))

@stemangiola
Copy link
Owner

Hello @MaximilianNuber , yes, we plan to identify and solve these efficiency issues. We will start soon with dedicated people on it, but feel free to propose a PR, that would help a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants