Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no mitochondrial genes are included in loom file (bug fixed, velocyto team please check this) #318

Closed
hyjforesight opened this issue Nov 12, 2021 · 19 comments

Comments

@hyjforesight
Copy link

hyjforesight commented Nov 12, 2021

Hello velocyto,
I transfer 10X bam file into loom file by velocyto.py and use Scanpy to do cell clustering. However, when doing the data processing by Scanpy, I found there are no mito genes included in loom file, see below.
image

There is a similar question raised here (velocyto-team/velocyto.R#96). I understand that "mito genes do not have introns, so no velocyto-type analysis can be done on these genes.----carlosf79". But the data processing by Scanpy is the pre-operation before RNA velocity analysis.

In my understanding, transferring bam file to loom file by velocyto.py is just re-doing the alignment to the reference genome by including the information of spliced and unspliced mRNAs.

If no mito genes were included in loom file, does it mean that the method to make loom file by velocyto is wrong, becasue the strategy of velocyto.py seems like "we check whether the genes have spliced and unspliced mRNAs, if not, removes the genes", rather than "we align the genes, and then check whether the genes have spliced or unspliced mRNAs, if not, mark this gene's spliced and unspliced mRNAs=0". If the first strategy is the truth, it means that velocyto.py removes thousands of genes because only 15-25% genes' unspliced RNAs can be detected in scRNA-seq, which means this strategy is a little biased.
So velocyto generated loom file cannot be used for Scanpy??

Thanks!
Best,
YJ

@hyjforesight
Copy link
Author

Hello Velocyto,
When running velocyto.py, it shows:
"WARNING - The .bam file refers to a chromosome 'MT+' not present in the annotation (.gtf) file"
"WARNING - The .bam file refers to a chromosome 'MT-' not present in the annotation (.gtf) file"

BAM file is from 10X generated by Cell Ranger v3.1. Reference genome is refdata-gex-mm10-2020-A.tar.gz from 10X. Repeat annotation file is mm10_rmsk.gtf from UCSC.

Is it because cell ranger v3.1 is too old and doesn't have the same chromosome name as refdata-gex-mm10-2020-A.tar.gz and mm10_rmsk.gtf? But when I check the chromosome names in BAM, I found it is ChrM, the same as refdata-gex-mm10-2020-A.tar.gz and mm10_rmsk.gtf? Then why the hint shows the .bam file refers to a chromosome 'MT-'?

Here is the code to check chromosome names in bam.
$ samtools view /home/hyjforesight/mybam.bam | cut -f3 | perl -ne 'print unless $seen{$_}++' -
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY

@hyjforesight
Copy link
Author

hyjforesight commented Nov 18, 2021

Hello velocyto,
I solved this bug.

  1. it's not caused by samtools. Please check the response from samtools.
    The .bam file refers to a chromosome 'MT+' not present in the annotation (.gtf) file. samtools/samtools#1545
  2. it's probably caused by velocyto.py using chrMT to search chrM when doing alignment. chrM is the default name of mito chromosome in bam generated by 10X cell ranger by aligning fastq to the reference genome.
  3. Solution: regenerate bam by cell ranger count pipeline, change all naming of chrM to chrMT in the reference genome folder, including genome.fa, genome.fa.fai, gene.gtf, chrName.txt, chrNamelength.txt files. I did all these renaming by VS Code. By these means, the bam file will include chrMT name. See below.
    捕获
  4. Use this new bam file for velocyto.py. Also change chrM to chrMT in rmsk.gtf. The new loom will include all infomation as 10X 3 file outputs did.
    image
    image
  5. Enjoy the new loom file for downstream analysis. scVelo on the same UMAP of Scanpy.

Because most people cannot do cell ranger, is it possible to fix this bug in the next release of velocyto.py?
Thanks!
Best,
YJ

@hyjforesight hyjforesight changed the title no mitochondrial genes are included in loom file no mitochondrial genes are included in loom file (bug fixed, velocyto team please check this) Nov 18, 2021
@denvercal1234GitHub
Copy link

denvercal1234GitHub commented Nov 20, 2021

@hyjforesight --- Thank you so much for looking into this! Hopefully @gioelelm will have some time to recommend a way to do this more seamlessly. 🙌

My cellsorted_possorted and possorted bam and gtf files indeed have chrM instead of chrMT

samtools view cellsorted_possorted_genome_bam.bam | cut -f3 | perl -ne 'print unless $seen{$_}++' -

(1) Did you just "find and replace", use some script, or bioSyntax extension in VS Code to change the naming of chrM to chrMT in the reference genome folder, including genome.fa, genome.fa.fai, genes.gtf, chrName.txt, chrNamelength.txt files? No change to genes.pickle nor other files in the /refdata-gex-GRCh38-2020-A/star directory?

(2) How did you change chrM to chrMT in rmsk.gtf? I guess it will be similar to how it is done in (1).

(3) Subsequently, I understand we will use the name-changed files to redo the cellranger count as "we should start from the fastq file to generate new bam file" with these new gtfs (velocyto-team/velocyto.R#96). To do this, did you simply use the --transcriptome=/opt/refdata-gex-GRCh38-2020-A \ with this new folder, and nothing needing to be done with the new name-changed masked gtf in cellranger, just needing to input this new masked gtf to run10x?

Ref: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count

Sorry for basic questions. Thank you so much for pointing me here in my issue #320 (I still have issue of it not finding the index file for the cellsorted bam yet it still generated the loom file --- #321 🥵).

@denvercal1234GitHub
Copy link

Updated:

For (1) --- Use VS Code to Find and Replace. And, yes, there is nothing "chrM" in any other files, e.g., genes.pickle etc.

For (2) --- Do similarly as (1)

@hyjforesight
Copy link
Author

hyjforesight commented Nov 21, 2021

Hi @denvercal1234GitHub,
for (3), cellranger 6.1.2 outputs have no outs/analys folder, which is required for run10x command in velocyto.py. so i use this command velocyto run -b filtered_barcodes.tsv -o output_path -m repeat_msk_srt.gtf possorted_genome_bam.bam mm10_annotation.gtf with the new bam file, and chrM replaced gtf.
BTW, the new bam file is important. Start from fastq. Run cell ranger count pipeline by yourself (need 90GB memory and 200GB free disk). Get new bam. Then velocyto.py.
Thanks!
Best,
YJ

@denvercal1234GitHub
Copy link

Thank you very much, @hyjforesight! Did you do some additional filtering of the barcodes.tsv? Because usually cellranger will only name it barcodes.tsv, and not filtered_barcodes.

Given the memory limitation, I needed to samtools sort the possorted_.. bam file first before running velocyto run. Do you by chance know if I can still use the index file (bam.bai) that came originally with the possorted_ bam file? Or I will need to index the manually sorted bam?

@hyjforesight
Copy link
Author

hyjforesight commented Nov 21, 2021

@denvercal1234GitHub
filtered matrices files are at \SC_RNA_COUNTER_CS\SC_MULTI_CORE\MULTI_GEM_WELL_PROCESSOR\COUNT_GEM_WELL_PROCESSOR_BASIC_SC_RNA_COUNTER\FILTER_BARCODES\fork0\files\filtered_matrices_mex
check the velocyto.py tutorial. no need to sort Cell Ranger generated bam file. it's already position sorted. just change the name to possorted_yourbamname.bam
The index file is not needed for velocyto.py

@denvercal1234GitHub
Copy link

Humm... I think the bam file from cellranger is actually only sorted by position, and velocyto then sorts it by cell (CB). That is why it renames it to cellsorted_...bam. That is why if memory is an issue, then it is recommended to cellsorted first then run run10x.

@hyjforesight
Copy link
Author

hyjforesight commented Nov 22, 2021

hello @denvercal1234GitHub
Yor are right. cellranger did position sorting. the CB sorting will be done by velocyto. we use $ velotcyto run command, which requires the filtered barcodes.tsv, not $ velocyto run10x command. the CB sorting will be done internally according to this barcodes.tsv.
Check the velocyto run --help
-b, --bcfile FILE Valid barcodes file, to filter the bam. If --bcfile is not specified all the cell barcodes will be included.
Cell barcodes should be specified in the bcfile as the CB tag for each read

To make it sure, let me try CB sorting individually and let you know whether the results are consistent.

BTW, 16GB memory is enough for velocyto.py

@hyjforesight
Copy link
Author

hyjforesight commented Nov 22, 2021

Updata: individual cell sorting generates errors. so no need to do this step. cell sorting will be done internally by $ velocyto run
image

@hyjforesight
Copy link
Author

hyjforesight commented Dec 2, 2021

hope velocyto team can fix this BUG in the next release, just change searching chrMT to search chrM.

@AY-LIANG
Copy link

Hi,
I rename all chM to chMT in the reference genome folder,and run cellranger count to get a new bam file.
After velocyto,I use scanpy to analyse the loom file,but the gene expression does not match the 10x results well.
There is still some chromosome not found in gtf(refdata-gex-GRCh38-2020-A,download from 10x Genomics ) during velocyto

2022-03-22 11:18:44,717 - WARNING - The .bam file refers to a chromosome 'KI270442.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,717 - WARNING - The .bam file refers to a chromosome 'KI270442.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,725 - DEBUG - Marking up chromosome KI270729.1
2022-03-22 11:18:44,725 - WARNING - The .bam file refers to a chromosome 'KI270729.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,725 - WARNING - The .bam file refers to a chromosome 'KI270729.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,726 - DEBUG - Marking up chromosome GL000225.1
2022-03-22 11:18:44,726 - WARNING - The .bam file refers to a chromosome 'GL000225.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,726 - WARNING - The .bam file refers to a chromosome 'GL000225.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,729 - DEBUG - Marking up chromosome KI270743.1
2022-03-22 11:18:44,729 - WARNING - The .bam file refers to a chromosome 'KI270743.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,729 - WARNING - The .bam file refers to a chromosome 'KI270743.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,731 - DEBUG - Marking up chromosome GL000009.2
2022-03-22 11:18:44,731 - WARNING - The .bam file refers to a chromosome 'GL000009.2+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,733 - DEBUG - Marking up chromosome KI270722.1
2022-03-22 11:18:44,734 - WARNING - The .bam file refers to a chromosome 'KI270722.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,734 - WARNING - The .bam file refers to a chromosome 'KI270722.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,734 - DEBUG - Marking up chromosome GL000194.1
2022-03-22 11:18:44,734 - WARNING - The .bam file refers to a chromosome 'GL000194.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,834 - DEBUG - Marking up chromosome KI270742.1
2022-03-22 11:18:44,834 - WARNING - The .bam file refers to a chromosome 'KI270742.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,834 - WARNING - The .bam file refers to a chromosome 'KI270742.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,837 - DEBUG - Marking up chromosome GL000205.2
2022-03-22 11:18:44,837 - WARNING - The .bam file refers to a chromosome 'GL000205.2+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,877 - DEBUG - Marking up chromosome GL000195.1
2022-03-22 11:18:44,901 - DEBUG - Marking up chromosome KI270736.1
2022-03-22 11:18:44,901 - WARNING - The .bam file refers to a chromosome 'KI270736.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,901 - WARNING - The .bam file refers to a chromosome 'KI270736.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,902 - DEBUG - Marking up chromosome KI270733.1
2022-03-22 11:18:44,902 - WARNING - The .bam file refers to a chromosome 'KI270733.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,902 - WARNING - The .bam file refers to a chromosome 'KI270733.1-' not present in the annotation (.gtf) file

Do you have any suggestion?

@denvercal1234GitHub
Copy link

denvercal1234GitHub commented Mar 22, 2022

@AY-LIANG I am not sure, but between which that the gene expression did not match? What do you meant by 10X results? You meant between run and run10X?

@AY-LIANG
Copy link

@denvercal1234GitHub
10x result is the output of CellRanger,which is, these three files
barcodes.tsv.gz features.tsv.gz matrix.mtx.gz

I use scanpy to call the highest expression genes,and there are some differences between the loom file and CellRanger result.

It seems that some gene information is missing during velocyto.I guess these warnings might be the reason.

@hyjforesight
Copy link
Author

hi @AY-LIANG
we can omit these GL KL chromosomes. it's useless.
velocyto removes some counts, check this theislab/scvelo#813
I don't use loom file for Scanpy. I use 10X matrics. when I need unspliced and spliced information. I just copy them from loom to the matrix.

ldata = scv.read(filename.loom, cache=True)
adata = scv.utils.merge(adata, ldata)

@AY-LIANG
Copy link

@hyjforesight
Thanks!It's very helpful

@0106WeiWeiDeng
Copy link

Acoording to this comment, and cellranger has been updated for serveral years, incongruences between .gtf and .bam are now removed, thus this issue may be solved by deleting two line of code of counter.py:
if chrom == "M":
chrom = "MT"
https://github.com/velocyto-team/velocyto.py/blob/0963dd2d/velocyto/counter.py#L282
https://github.com/velocyto-team/velocyto.py/blob/0963dd2d/velocyto/counter.py#L283

@hernandezvargash
Copy link

thank you @0106WeiWeiDeng I just tested your suggestion (deleting the two lines of code in counter.py) and the velocyto run seems to be fine, at least I have mitochondrial genes again.

@shuailinli
Copy link

Hi, I rename all chM to chMT in the reference genome folder,and run cellranger count to get a new bam file. After velocyto,I use scanpy to analyse the loom file,but the gene expression does not match the 10x results well. There is still some chromosome not found in gtf(refdata-gex-GRCh38-2020-A,download from 10x Genomics ) during velocyto

2022-03-22 11:18:44,717 - WARNING - The .bam file refers to a chromosome 'KI270442.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,717 - WARNING - The .bam file refers to a chromosome 'KI270442.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,725 - DEBUG - Marking up chromosome KI270729.1
2022-03-22 11:18:44,725 - WARNING - The .bam file refers to a chromosome 'KI270729.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,725 - WARNING - The .bam file refers to a chromosome 'KI270729.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,726 - DEBUG - Marking up chromosome GL000225.1
2022-03-22 11:18:44,726 - WARNING - The .bam file refers to a chromosome 'GL000225.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,726 - WARNING - The .bam file refers to a chromosome 'GL000225.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,729 - DEBUG - Marking up chromosome KI270743.1
2022-03-22 11:18:44,729 - WARNING - The .bam file refers to a chromosome 'KI270743.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,729 - WARNING - The .bam file refers to a chromosome 'KI270743.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,731 - DEBUG - Marking up chromosome GL000009.2
2022-03-22 11:18:44,731 - WARNING - The .bam file refers to a chromosome 'GL000009.2+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,733 - DEBUG - Marking up chromosome KI270722.1
2022-03-22 11:18:44,734 - WARNING - The .bam file refers to a chromosome 'KI270722.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,734 - WARNING - The .bam file refers to a chromosome 'KI270722.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,734 - DEBUG - Marking up chromosome GL000194.1
2022-03-22 11:18:44,734 - WARNING - The .bam file refers to a chromosome 'GL000194.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,834 - DEBUG - Marking up chromosome KI270742.1
2022-03-22 11:18:44,834 - WARNING - The .bam file refers to a chromosome 'KI270742.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,834 - WARNING - The .bam file refers to a chromosome 'KI270742.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,837 - DEBUG - Marking up chromosome GL000205.2
2022-03-22 11:18:44,837 - WARNING - The .bam file refers to a chromosome 'GL000205.2+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,877 - DEBUG - Marking up chromosome GL000195.1
2022-03-22 11:18:44,901 - DEBUG - Marking up chromosome KI270736.1
2022-03-22 11:18:44,901 - WARNING - The .bam file refers to a chromosome 'KI270736.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,901 - WARNING - The .bam file refers to a chromosome 'KI270736.1-' not present in the annotation (.gtf) file
2022-03-22 11:18:44,902 - DEBUG - Marking up chromosome KI270733.1
2022-03-22 11:18:44,902 - WARNING - The .bam file refers to a chromosome 'KI270733.1+' not present in the annotation (.gtf) file
2022-03-22 11:18:44,902 - WARNING - The .bam file refers to a chromosome 'KI270733.1-' not present in the annotation (.gtf) file

Do you have any suggestion?

Hi @AY-LIANG, I know this post is quite old, but I just want to add that this warning happens because the code for masking will check if every read in bam file and see if the chromosome it is mapped is in the gtf file or not. In the case of reads mapped to regions like GL000194.1, it is in fasta file, but not gtf file, so it will generate the warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants