Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Banksy on large Xenium Dataset #39

Open
Alwash-317 opened this issue Sep 18, 2024 · 2 comments
Open

Running Banksy on large Xenium Dataset #39

Alwash-317 opened this issue Sep 18, 2024 · 2 comments

Comments

@Alwash-317
Copy link

Hi,

I’m working with an integrated Xenium dataset consisting of 12 samples, totaling approximately 5.4 million cells. After pre-processing the individual Xenium samples, I merged them into a single Seurat object for downstream analysis. However, I’m encountering issues when trying to run BANKSY due to the large size of the dataset. The R script is as follows:

`file_paths <- c(
"path_1", "path_2", ..., "path_12")

sample_names <- c(
"sample_1", "sample_2", ..., "sample_12")

seu_list <- list()

for (i in seq_along(file_paths)) {
seu <- readRDS(file_paths[i])
coords <- seu[[paste0("fov_", sample_names[i])]]$centroids@coords
seu$sdimx <- coords[, 1]
seu$sdimy <- coords[, 2]
seu_list[[i]] <- seu
}

merged_seu <- Reduce(merge, seu_list)

merged_seu <- JoinLayers(merged_seu)

DefaultAssay(merged_seu) <- "Xenium"

merged_seu <- RunBanksy(
merged_seu,
lambda = 0.8,
assay = 'Xenium',
slot = 'data',
features = 'all',
group = 'Sample_ID',
dimx = 'sdimx',
dimy = 'sdimy',
split.scale = TRUE,
k_geom = 15)`

And it crashes at the RunBanksy step with the following log error:

Error in [.data.table(knn_df, , abs(gcm[, to, drop = FALSE] %*% (weight *  : 
  negative length vectors are not allowed
Calls: RunBanksy ... mapply -> <Anonymous> -> <Anonymous> -> [ -> [.data.table
In addition: Warning message:
In asMethod(object) :
  sparse->dense coercion: allocating vector of size 19.3 GiB
Execution halted.

I attempted to allocate more memory for the script (up to 800 GB), and monitored memory usage, which didn’t exceed this limit at the time of the crash. I also used the future package with the setting options(future.globals.maxSize = 256 * 1024^3), but the issue persists.

Given the size of the dataset, are there any computationally less intensive approaches or optimizations you would recommend for running BANKSY on such large datasets? Any suggestions to handle memory usage more efficiently or alternative strategies would be greatly appreciated.

Thank you for your help!

@vipulsinghal02
Copy link
Collaborator

Hi Alwash, have you tried using hightly variable genes (2000 HVGs in Seurat)? Another optimization to try is to first do HVG to 2000 genes, then further reduce to 100 PCs. this 100PC by 5.4 million cell matrix is now your new feature-cell matrix ("gene"-cell matrix), and you start the usual of the pipeline on this.

This should greatly reduce dataset size and allow for the processing. Another idea is to use the BPcells package (which seurat supports, see their pages/vignettes).

Best,
Vipul

@vipulsinghal02 vipulsinghal02 reopened this Oct 2, 2024
@vipulsinghal02
Copy link
Collaborator

Also:

  1. construct the BANKSY matrix separately for each sample, merge them, and run PCA on the merged matrix.
  2. See this: Potential inconsistency between R and Python Versions Banksy_py#12 (comment)

Let me know how it goes!
Best,
Vipul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants