Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nuclear RNA #11

Open
lpantano opened this issue Jul 19, 2019 · 20 comments
Open

nuclear RNA #11

lpantano opened this issue Jul 19, 2019 · 20 comments

Comments

@lpantano
Copy link

Hi,

I am trying the tool for the quantification of nuclear RNA, after my recent tweets with Lior Patcher.

Every commands in the Velocity tutorial seems to work, but I want only the quantification by gene taking into account reads in introns and exons, so I am really don't want the outputs in the velocity tutorial.

I imagine that something like this is the solution after doing correct and sort:

bustools count -o counts -g t2g.txt -e matrix.ec -t transcripts.txt --genecounts output.sort.bus

but I want to make sure that t2g.txt is the correct. I noticed the annotation for introns is TRANSCRIPT.N-I meanwhile the exons is TRANSCRIPT.N. If the t2g.txt file is lie this:

ENST00000456328.1       ENSG00000223972 DDX11L1
ENST00000450305.2       ENSG00000223972 DDX11L1
ENST00000488147.3       ENSG00000227232 WASH7P

It would recognized introns and exons, or I need to add the intron annotation to the file?

Thanks!

@Munfred
Copy link

Munfred commented Jul 19, 2019

The cDNA_introns.t2g.txt file provided in the tutorial does have both intons and exons in the same txt file. You can take a look at the beginning and end of the file by doing:

$ wget https://github.com/BUStools/getting_started/releases/download/velocity_tutorial/cDNA_introns.t2g.txt.gz

$ gunzip cDNA_introns.t2g.txt.gz

$ head cDNA_introns.t2g.txt
ENST00000456328.1       ENSG00000223972 DDX11L1
ENST00000450305.2       ENSG00000223972 DDX11L1
ENST00000488147.3       ENSG00000227232 WASH7P
ENST00000619216.4       ENSG00000278267 MIR6859-1
ENST00000473358.5       ENSG00000243485 MIR1302-2HG
ENST00000469289.6       ENSG00000243485 MIR1302-2HG
ENST00000607096.7       ENSG00000284332 MIR1302-2
ENST00000417324.8       ENSG00000237613 FAM138A
ENST00000461467.9       ENSG00000237613 FAM138A
ENST00000606857.10      ENSG00000268020 OR4G4P

$ tail cDNA_introns.t2g.txt
ENST00000463325.1056070-I       ENSG00000184319 RPL23AP82
ENST00000463325.1056071-I       ENSG00000184319 RPL23AP82
ENST00000494075.1056072-I       ENSG00000184319 RPL23AP82
ENST00000494075.1056073-I       ENSG00000184319 RPL23AP82
ENST00000494075.1056074-I       ENSG00000184319 RPL23AP82
ENST00000423888.1056075-I       ENSG00000184319 RPL23AP82
ENST00000423888.1056076-I       ENSG00000184319 RPL23AP82
ENST00000423888.1056077-I       ENSG00000184319 RPL23AP82
ENST00000480246.1056078-I       ENSG00000184319 RPL23AP82
ENST00000480246.1056079-I       ENSG00000184319 RPL23AP82

@lpantano
Copy link
Author

Thanks, so this command should be ok?:

bustools count -o genes/ --genecounts -g cDNA_introns.t2g.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

@Munfred
Copy link

Munfred commented Jul 23, 2019

Yes.

Just so you know, there is small issue right now in that bustools will not create the subfolder if you give it a multi folder output (e.g. if you want the output on ./sample2/genecounts) so you need to make sure the folder exist in that case. If you'd like to do that, this is how I'd make sure the folder gets created prior to running the command:

mkdir -p ./sample2/genecounts

bustools count \
--output  ./sample2/genecounts/genes \
--genecounts \
--genemap /references/homo_sapiens-ensembl-96/transcripts_to_genes.txt \
--ecmap ./sample2/matrix.ec \
--txnames  ./sample2/transcripts.txt \
./sample2/output.corrected.sorted.bus

@lpantano
Copy link
Author

Thank you!

Sadly I am having a segmentation fault (with the conda and source code compiled by myself) only for that command, this is the debug output, I don't know if that helps.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x000000000043f9e0 in std::vector<int, std::allocator<int> >::begin (this=0x2aabd9af73f8) at /usr/include/c++/4.8.2/bits/stl_vector.h:548
548           { return const_iterator(this->_M_impl._M_start); }
(gdb) bt
#0  0x000000000043f9e0 in std::vector<int, std::allocator<int> >::begin (this=0x2aabd9af73f8) at /usr/include/c++/4.8.2/bits/stl_vector.h:548
#1  0x000000000045cb53 in intersect_genes_of_ecs (ecs=std::vector of length 1, capacity 100 = {...}, ec2genes=std::vector of length 12317869, capacity 16777216 = {...},
    glist=std::vector of length 0, capacity 100) at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/Common.cpp:174
#2  0x0000000000450ddb in __lambda1::operator() (__closure=0x7fffffffbdb0, v=std::vector of length 6345915, capacity 6400000 = {...})
    at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/bustools_count.cpp:177
#3  0x0000000000451a87 in bustools_count (opt=...) at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/bustools_count.cpp:241
#4  0x000000000043dd35 in main (argc=12, argv=0x7fffffffd868) at /home/lpantano/om2-pilm/conda3/envs/bustools/share/bustools/src/bustools_main.cpp:1072

The output folder has an empty MTX file.

the other count commands work, so is not this step specifically.

I will try the test data to make sure is not my sample, but I really want to have this working with my data, let me know if I can give you any more information.

If it is the cpp library version, can you tell me the one you have that is working so I can try to compile versus that particular version?

thanks!

@Munfred
Copy link

Munfred commented Jul 23, 2019 via email

@lpantano
Copy link
Author

Thank you.

files exist and they are not empty:

4.7G    matrix.ec
1.5G    output.correct.sort.bus
30M     transcripts.txt
58M  cDNA_introns.t2g.txt

output folder exists. this is the command I used:

bustools count -o genes/genes --genecounts -g cDNA_introns.t2g.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

it takes a while to give the error, like 20 min or so. There is an empty file at genes/genes.mtx.

I tried the with the example data: bus_output_06 and the same error appeared. I used only a pair of files when aligning to speed up computation:

06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R1_001.fastq.gz 06/10X_17_029_MissingLibrary_1_HL
73JBCXY/bamtofastq_S1_L002_R2_001.fastq.gz

I tried in a different cluster machines, and the same seg.fault appeared:

Program received signal SIGSEGV, Segmentation fault.
0x000055555557682b in intersect_genes_of_ecs(std::vector<int, std::allocator<int> > const&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, std::vector<int, std::allocator<int> >&) ()

I know it is a difficult error to catch up, I am happy to help in any way possible.

cheers

@Munfred
Copy link

Munfred commented Jul 25, 2019 via email

@lpantano
Copy link
Author

Thank you! Here are the files:

wget https://lpantano-debug.s3.amazonaws.com/bustools/bustools
wget https://lpantano-debug.s3.amazonaws.com/bustools/cDNA_introns.t2g.txt
wget https://lpantano-debug.s3.amazonaws.com/bustools/matrix.ec
wget https://lpantano-debug.s3.amazonaws.com/bustools/output.correct.sort.bus
wget https://lpantano-debug.s3.amazonaws.com/bustools/transcripts.txt

I installed bustools with conda.

thanks so much

@Munfred
Copy link

Munfred commented Jul 26, 2019

How did you generate this bus file? I converted the 2.6GB bus file you provided into text to inspect it and it expanded to 205GB.

Each entry that should correspond to a UMI has a length of 2573 charachters, and it repeats every 32 characters. I have attached the first 5 lines of the file for you to understand.

If you can provide the detailed steps you used to generate this file we can look further into the issue.

nuclear_rna_first_5_lines.txt

@lpantano
Copy link
Author

lpantano commented Jul 26, 2019 via email

@lpantano
Copy link
Author

lpantano commented Jul 26, 2019 via email

@Munfred
Copy link

Munfred commented Jul 26, 2019 via email

@lpantano
Copy link
Author

yes, the bus file from my sample looks the same than that. Used the same command I posted before to produce the bus file, with kallisto.

@Munfred
Copy link

Munfred commented Jul 26, 2019 via email

@lpantano
Copy link
Author

Sure.

These are the commands to create the bus file from the bam files:

bamtofastq --reads-per-fastq=500000000 10X_17_029.bam ./06

kallisto bus -i cDNA_introns.idx -o bus_output_06/ -x 10xv2 -t 4 06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R1_001.fastq.gz 06/10X_17_029_MissingLibrary_1_HL73JBCXY/bamtofastq_S1_L002_R2_001.fastq.gz

bustools correct -w ../../10xv2_whitelist.txt -p output.bus | bustools sort -o output.correct.sort.bus -t 4 -

The other sample I have is not generated with bamtofastq, they were generated with cellranger mkfasta and the bus file is the same.

R1:

@D00456:228:HL73JBCXY:2:1111:15821:53600 1:N:0:0
CTAGCCCTAACCCTAACCCTAACCCT
+
.GAGA<GGGGAGGGGGGGGAAA.A<G
@D00456:228:HL73JBCXY:2:1213:6403:92501 1:N:0:0
AACCCTAACCCTAACCCTAACCCTAA
+
GGGGAGGGGGAGGAGGGGGG..AGAG
@D00456:228:HL73JBCXY:2:2106:9266:5974 1:N:0:0
ACGTCAAAGCAGCCTCCACAATGTGG

R2:

@D00456:228:HL73JBCXY:2:1111:15821:53600 3:N:0:0
GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGCTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGGATAGGGTTCGGTCTAGGGATAGG
+
AGGAAA..GG.<<.G<A.G<G..<G<G<GAGG..<AGAG<..<<A..<<GG.<<GGA<..<<<..<...<..<<.<<AA...<...<.<...<<....
@D00456:228:HL73JBCXY:2:1213:6403:92501 3:N:0:0
GGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGAGTAGGGATGGGGAGCGGGGAGGGGAGAGG
+
GA<G.AAA<GGGAG<AAGGGI<..<GGGGGAAAG..<.<.<.<AGAG.GGG.<.GAG..G<<<...<<....<<......<.<.<.......<..<..
@D00456:228:HL73JBCXY:2:2106:9266:5974 3:N:0:0
GCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAAGGCTGACGGCAAGTTAACGAAAAGAAAAATGGTGAATGATACCCGGTGC

In a last note, the velocity tutorial works if I use bustools capture and bustools count to get the two type of matrices. This is only happening when I want just to do bustools count with the matrix.ec file firectly since I want to quantify all reads into genes. (https://www.kallistobus.tools/velocity_tutorial.html)

thanks!

@Munfred
Copy link

Munfred commented Jul 26, 2019 via email

@lpantano
Copy link
Author

lpantano commented Jul 26, 2019 via email

@Munfred
Copy link

Munfred commented Jul 26, 2019

Ok we're very puzzled as to how the bus file with the 2573 characters in the UMI have been made, and ran out of ideas.

Can you subsample 100k reads from R1 and R2 so that we can try to reproduce the problematic bus file? seqtk is an easy way to subsample them: https://github.com/lh3/seqtk

@lpantano
Copy link
Author

lpantano commented Jul 27, 2019 via email

@lpantano
Copy link
Author

lpantano commented Aug 1, 2019

Hi,

I uploaded the BUS file before sorting and correction that it seems right. Do you still want the FASTQ files?

https://lpantano-debug.s3.amazonaws.com/bustools/output.bus

As I posted last week, the weird bus appears after the correct command.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants