Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10x Multiome RNA vs ATAC data from same cells confidently disagrees on donor identities #70

Open
koefoeden opened this issue Aug 26, 2022 · 11 comments

Comments

@koefoeden
Copy link

koefoeden commented Aug 26, 2022

Hi - thanks for a very cool tool!

So I'm trying to demultiplex a 10x Multiome run, and tried to demultiplex the data using either the ATAC or the RNA bam file. The issue is that they give completely different donor identities, while both being quite confident that they got it right. Because we sequenced it with HTO's as well, we know however that it is the ATAC which is right. Can you help me uncover what is going on here??

Please see the figures below, where I only focused on cells that were confidently called, using ATAC, HTO or RNA data.

image

image

Best, Thomas

@koefoeden
Copy link
Author

I tried merely rerunning the RNA-demultiplexing with the same result, and another time using --minMAF = 0 instead of 0.1 in cellsnp-lite - giving very different but still non-concordant results. My main concern is how it seems confident in these seemingly wrongly labeled cells?

Using the --minMAF 0.1 I get a median of 1k vars for the barcodes using the ATAC data, and 1.7k vars using the RNA data.

@huangyh09
Copy link
Collaborator

Thanks Thomas for sharing the experience. Agreed that the results on the RNA module are (probably) wrong. One potential reason is the SNPs are used, as you tried. Using --minMAF = 0 is probably not ideal and --minMAF 0.1 (or even a bit higher, e.g., 0.15) would be a better threshold to identify SNPs with different genotypes between donors. Generally, ATAC may cover the whole gene body while RNA only covers the 3' or 5' of the transcript so may lose a proportion of informative SNPs.

Nonetheless, the RNA module usually works well. Another potential issue could local optima, so you may consider trying a large number of initializations, e.g., -M 200.

Yuanhua

@ollieeknight
Copy link

@koefoeden, did you ever figure out what's going on here, and optimise it to work for multiome RNA? Thanks

@koefoeden
Copy link
Author

@ollieeknight Unfortunately I haven't investigated it further -but will just start using it in a "larger scale" in the coming week, so will likely get some new insight. Let's keep in touch! Are you also using the Multiome kit?

@ollieeknight
Copy link

Yep, I also found out my issue. I was running cellbender on each modality separately, to basically find the venn diagram of high quality of each modality and where they intersect - this was messing up the order of barcodes and once I set the order everything was much better, and works perfectly for both RNA and ATAC. hope this helps you?

@koefoeden
Copy link
Author

@ollieeknight Hmm - we haven't used cellbender or any other kind of pre-processing - so should not be the case. Also relevant to you @huangyh09, we just tried to demultiplex 8 additional 10x Genomics Multiome runs, and are again experiencing that the ATAC and GEX data confidently disagree on donor identities, using now default --minMAF values. We will likely try with a higher number of initializations - but shouldn't any disagreement due to local optima be reflected in the reported probabilities?

@ollieeknight
Copy link

damn, sorry that you couldn't use the same fix. could you post the full command you're using for cellsnp-lite and vireo, for both GEx and ATAC?

@huangyh09
Copy link
Collaborator

@ollieeknight Hmm - we haven't used cellbender or any other kind of pre-processing - so should not be the case. Also relevant to you @huangyh09, we just tried to demultiplex 8 additional 10x Genomics Multiome runs, and are again experiencing that the ATAC and GEX data confidently disagree on donor identities, using now default --minMAF values. We will likely try with a higher number of initializations - but shouldn't any disagreement due to local optima be reflected in the reported probabilities?

Hi @koefoeden can you also provide the coverage information, e.g., distribution of n_SNPs per cells, in both scRNA and scATAC. Do you see that the discordant cells have lower coverage in one of the modularity?

Thanks both for keeping discussing and troubleshooting this.

Yuanhua

@ghuls
Copy link
Contributor

ghuls commented Aug 21, 2023

It might help to limit your VCF file for RNA to gene regions only (exons+introns) to avoid mutations out of gene regions having a negative impact on patient identification (as you shouldn't have reads there).

# Create BED file which contains the whole gene regions for all genes in CellRangerARC index.
zcat /genomes/10xgenomics/CellRangerARC/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/genes/genes.gtf.gz \
  | awk -F '\t' '$3 == "gene"' \
  | bedtools merge \
  > refdata-cellranger-arc-GRCh38-2020-A-2.0.0-gene-regions.bed

# Both BED files are the same, so we only need one to generate a 1000 Genomes file resticted to gene regions.
bcftools view \
    -R refdata-cellranger-arc-GRCh38-2020-A-2.0.0-gene-regions.bed \
    -O z -o 1000g_hg38_high_cov_snps_af_0.1_to_0.9.gene_regions_only.hg38.vcf.gz \
    1000g_hg38_high_cov_snps_af_0.1_to_0.9.hg38.vcf.gz

@Pancreas-Pratik
Copy link

@koefoeden were you able to figure this out?

@ollieeknight is there a tutorial for how you used cellblender together with vireo in your comment above (#70 (comment))? If there isn't could you at the very least just link the tutorials you "pieced" together, please?

I have 10X multiome data where the scATAC-seq component has the correct indices documented for the respective sample names, however this is, unfortunately, not the case for the scRNA-seq component. I am hoping to use the scATAC-seq component as a reference in some way to help identify what the correct sample names should be for the scRNA-seq component.

Any insights or clues to help me get up to speed would be appreciated.
Thank you very much in advance.

@ollieeknight
Copy link

@Pancreas-Pratik

For all multiome data, I now run cellranger multi for the RNA component, and cellranger-atac for the ATAC component separately. Here is what I do for cellbender/cellsnp-lite/vireo

For scRNA-seq outputs from cellranger multi

cellbender remove-background \
        --cuda \
        --input ${feature_matrix_path} \
        --output ${project_outs}/${library}/cellbender/output.h5

cellsnp-lite \
        -s ${project_outs}/${library}/outs/per_sample_outs/${library}/count/sample_alignments.bam \
        -b ${project_outs}/${library}/cellbender/output_cell_barcodes.csv \
        -O ${project_outs}/${library}/vireo \
        -R /opt/SNP/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz \
        --minMAF 0.1 \
        --minCOUNT 20 \
        --gzip \
        -p $(nproc)

vireo \
        -c ${project_outs}/${library}/vireo \
        -o ${project_outs}/${library}/vireo \
        -N $n_donors \
        -p $(nproc)

When I then analyse this in R, you need to change the barcode names, which I do with

libraries <- readRDS('llist_of_RNA_seurat_objects.rds')

barcode_mapping <- data.frame(
  gene_expression_barcodes = readLines('~/group/work/bin/cellranger-arc-2.0.2/lib/python/cellranger/barcodes/737K-arc-v1.txt.gz'),
  atac_barcodes = readLines('~/group/work/bin/cellranger-arc-2.0.2/lib/python/atac/barcodes/737K-arc-v1.txt.gz')
)

# We need to correct barcodes, as different ones are used for ATAC and GEX in DOGMA/MULTIOME
for (i in 1:length(libraries)) {
  libraries[[i]] <- RenameCells(libraries[[i]], new.names = (colnames(libraries[[i]]) %>% gsub(pattern = '-1', replacement = '')))
  
  new_names <- barcode_mapping$atac_barcodes[
    match(
      colnames(libraries[[i]]),
      barcode_mapping$gene_expression_barcodes
    )
  ]
  libraries[[i]] <- RenameCells(libraries[[i]], new.names = paste0(new_names, '_', i))
}

rm(barcode_mapping, new_names)

Hope this is helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants