Analysis scripts and code for our research article: Losilla, M., Luecke, D.M. & Gallant, J.R. The transcriptional correlates of divergent electric organ discharges in Paramormyrops electric fish. BMC Evol Biol 20, 6 (2020). https://doi.org/10.1186/s12862-019-1572-3
This repository contains files with the code we used in our analysis.
The table below serves as a guide to understand the flow of the code. It details the order in which the code was executed, along with a description and comments of each step. Notes are shown in bold text.
Note: that a Singularity file is provided in the folder trinity_singularity to run on high performance computing systems. This would allow any user capable of running Singularity images to recreate the exact computing environment used for these analyses, though it is not required.
script/command file | description | comments | additional_outputs (These are provided in the folder named additional_files) |
---|---|---|---|
sh_01_FastQCraw.sh | assess quality of raw reads | ||
sh_02_trim_rename_unzip.sh | trim, rename and unzip reads | ||
sh_03_FastQCtrimmed.sh | assess quality of trimmed reads | ||
The NCBI transcripts file we used as reference for the align and count steps was from: NCBI Paramormyrops kingsleyae Annotation Release 100, based on genome assembly PKINGS_0.1. We downloaded the transcripts file from here: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/872/115/GCF_002872115.1_PKINGS_0.1 We used the file called: rna.fna.gz, and removed the sole rRNA transcript present: XR_002837744.1 | |||
cmd_generate_gene_to_trans_file.txt | generate a gene-to-transcript list from the NCBI transcripts file | this list is required by the align and count steps | gene-trans-map.txt |
sh_04a_RSEMindex.sh | Index the NCBI transcripts file | calls the singularity container | |
sh_04a_bash.sh | Index the NCBI transcripts file | executes commands within the singularity container | |
sh_04b_RSEMperIndiv.sh | Aligns reads to NCBI transcripts file and counts reads per gene | calls the singularity container | |
sh_04b_bash.sh | Aligns reads to NCBI transcripts file and counts reads per gene | executes commands within the singularity container | |
sh_04c_matrices.sh | Build gene expression matrices | calls the singularity container | |
sh_04c_bash.sh | Build gene expression matrices | executes commands within the singularity container | |
At this point the gene expression matrices (RSEM.gene.counts.matrix and RSEM.gene.TMM.counts.matrix ) use gene names and symbols from the NCBI transcriptome. However, EntrezGeneIDs are preferred for downstream analyses. Therefore, I converted their gene names and symbols to Pkings EntrezGeneIDs with the next R code. The converted files were assigned to the original file names. The original files were first renamed to: <orginal name>_ORIG_gene_symbols | |||
translate_gene_IDs.Rmd |
|
This code runs on the renamed files | Dic.PkingEntrezGeneID-to-name_symbol_type.txt |
sh_05a_DE_analyses.sh |
|
calls the singularity container | |
sh_05a_bash_DE_genes.sh |
|
uses the samples.txt file | |
Clustering_of_DEG_mean.Rmd |
|
generates black & white and colored plots for Set B genes (These plots served informational purposes) | |
generate_suppl_files_DEG_comparisons_and_groups.Rmd | generate the supplemental files with the details of the
|
||
sh_06_blastp.sh | blast P. kingsleyae proteins to D. rerio proteins | output is split into 7 files, we merged all to one file afterwards | |
Annotation_wrangling.Rmd | For each ontology, generate two 'dictionaries':
|
Files from 2) were not used in later scripts, they served as references |
|
enrichment_on_Pkings _all_10_DGE_comparisons.Rmd |
|
||
enrichment_on_Pkings_clusters.Rmd |
|
||
set_C.Rmd |
|
The outputs are:
|