Skip to content

sRNAfrag: FAQ

Ken Nakatsu edited this page Jul 11, 2023 · 13 revisions

FAQ

1. "I am getting a chromosome not found error when trying to add sequences or while running the pipeline"

To speed up the performance of BedTools sequence extraction, we assume that that there would be an equal number of sequences and entries in the GTF file. This vulnerability was at a cost of being over 1000X faster (First method took hours, but was safe. Current method takes minutes.) To fix:

a) Extract chromosomes from the fasta file. Will return a list in the console.

python alias_work.py fasta_chr_extract <fasta_file>

b) Filter the gtf file for chromosomes that are only in the reference genome you are using.

python alias_work.py gtf_chr_select <input_gtf> <output_name> --chr_list=[]

2. "Why did you call system commands from python, as opposed to using bash shell scripts?"

A lot of the bash code used in this pipeline are one liners. Thus, we did not believe it to be necessary. However, we are aware of some weaknesses that come with such a choice.

3. "Can your pipeline deal with annotation files that have duplicated sequences?"

Yes. One of the preprocessing measures deduplicates sequences to output a new GTF file with only the first occurrence. This not only speeds up the generation of the lookup table, but ensure that the multi-mapping corrections works correctly as it doesn't double count (since the lookup table is not contaminated with the duplicated sequence, and thus does not count it). This is important to consider since we are aligning against a transcriptome. Also, SAMtools just simply won't work if there are duplicated sequences.

4. "Can I use any of the scripts from the command line?"

We use the Fire python package to offer command line support for all python functions. https://google.github.io/python-fire/guide/. For more information, please check out the other parts of our wiki.

5. "What defines an alias?"

In this context, an alias is a sequence that has two IDs associated with it.

6. "Can this pipeline be used for something other than rRNA, snRNA, or snoRNAs?"

Absolutely! Recently papers have begun to investigate fragmentation of introns and even lncRNAs. While the pipeline will certainly run slower, I believe that it can work. It is not tested. If functionality is not maintained on these RNAs, I will adjust the pipeline to fix the error. Please submit an issue with the exact error and I will investigate. The primary error that I solved was that FeatureCounts cannot take in so many fragments. So instead, I divided them into n/8000 subgroups, and then annotate each set of fragments individually.

7. "When I view the merged count file in Excel, why are annotations on new lines?"

This is unfortunately an error with Excel when entries are too long. Viewing in numbers (the mac program) and importing into pandas behave as expected. I would say that you should import into Python and work with it there for downstream analyses. Should also work with R.