Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demultiplexing multiomic sequencing data #95

Open
DHelix opened this issue Apr 1, 2024 · 2 comments
Open

Demultiplexing multiomic sequencing data #95

DHelix opened this issue Apr 1, 2024 · 2 comments

Comments

@DHelix
Copy link

DHelix commented Apr 1, 2024

Hi @huangyh09,

First of all, huge thanks for developing Vireo!
I've been testing it using a synthetic pool (3 donors), and I've noticed a high number of unassigned cells, particularly from one donor, based on scRNA-seq data alone. I found a potential solution by combining scRNA and scATAC data to increase the coverage, described in this #39 (comment):
"... you can use bcftools concat if you have *cells.vcf.gz (by using --genotype in cellsnp-lite). Alternatively, you may try combining the sparse matrices directly."

So I tried:

  1. Ran cellsnp-lite on scRNA and scATAC data separately, with --genotype
  2. Sorted and indexed the two cellSNP.cells.vcf.gz files, generated in Step 1:
# scRNA
bcftools sort \
-m 2G \
-o ./scRNA/cellSNP.cells.vcf.sort.gz \
-O z9 \
-T TMP_DIR \
--write-index \
./scRNA/cellSNP.cells.vcf.gz

# scATAC
bcftools sort \
-m 2G \
-o ./scATAC/cellSNP.cells.vcf.sort.gz \
-O z9 \
-T TMP_DIR \
--write-index \
./scATAC/cellSNP.cells.vcf.gz
  1. Concatenated the two cellSNP.cells.vcf.sort.gz files
bcftools concat \
--allow-overlaps \
-o ./scRNA_scATAC/cellSNP.cells.vcf.gz \
-O z9 \
--threads 32 \
./scRNA/cellSNP.cells.vcf.sort.gz \
./scATAC/cellSNP.cells.vcf.sort.gz
  1. Ran Vireo on the concatenated cellSNP.cells.vcf.gz file
vireo \
-c ./scRNA_scATAC/cellSNP.cells.vcf.gz \
-N 3 \
-o ./scRNA_scATAC/sd1 \
--randSeed=1 \
-p 16

When I ran Vireo separately on the scRNA and scATAC data (providing the cellsnp-lite output folders, rather than the cellSNP.cells.vcf.gz files), it worked well and usually finished in < 20 mins. However, when I demultiplexed using the combined cellSNP.cells.vcf.gz file, it ran for several hours and finally got the following error:

[vireo] Loading cell VCF file ...
[vireo] Demultiplex 18491 cells to 3 donors with 908898 variants.
Traceback (most recent call last):
  File "/projects/Installs/python_virtualenv/vireo/bin/vireo", line 8, in <module>
    sys.exit(main())
  File "/projects/Installs/python_virtualenv/vireo/lib/python3.7/site-packages/vireoSNP/vireo.py", line 209, in main
    nproc=options.nproc)
  File "/projects/Installs/python_virtualenv/vireo/lib/python3.7/site-packages/vireoSNP/utils/vireo_wrap.py", line 76, in vireo_wrap
    pool = multiprocessing.Pool(processes = nproc)
  File "/linux-x86_64-centos7/python-3.7.2/lib/python3.7/multiprocessing/context.py", line 117, in Pool
    from .pool import Pool
  File "/linux-x86_64-centos7/python-3.7.2/lib/python3.7/multiprocessing/pool.py", line 17, in <module>
    import queue
  File "/linux-x86_64-centos7/python-3.7.2/lib/python3.7/queue.py", line 16, in <module>
    from _queue import Empty
ImportError: /linux-x86_64-centos7/python-3.7.2/lib/python3.7/lib-dynload/_queue.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object: Cannot allocate memory

I'm hoping you could give me some suggestions:

  1. Did I do it correctly?
  2. Could you please provide more details on "Alternatively, you may try combining the sparse matrices directly"?
  3. What's the best approach to combine scRNA and scATAC for demultiplexing?
  4. Do you think combining scRNA and scATAC data can also improve doublet detection?

Thanks a lot for your time!

@DHelix
Copy link
Author

DHelix commented Apr 3, 2024

Hi,
It seems that the cellSNP.cells.vcf.gz file generated by concatenating the scRNA and scATACcellSNP.cells.vcf.gz files using bcftools concat is too large (740M).
I wonder if it's possible to generate the cellSNP.tag.AD.mtx, cellSNP.tag.DP.mtx, cellSNP.base.vcf.gz, and cellSNP.samples.tsv files from the cellSNP.cells.vcf.gz file?
Thanks!

@huangyh09
Copy link
Collaborator

Hi, it looks like after concatenating, you got 908898 SNPs, which is quite a lot.

If your scATAC is better covered, you may consider demultiplexing just with scATAC. Also, the inferred genotype there can be used as input for demultiplexing scRNA if needed.

In either case, I never tested these and it only based on experiences in other settings, so your results may be different.

Yuanhua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants