Skip to content

Analysis scripts and code for Paramormyrops RNA-seq project

Notifications You must be signed in to change notification settings

msuefishlab/paramormyrops_rnaseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

paramormyrops_rnaseq

Analysis scripts and code for our research article: Losilla, M., Luecke, D.M. & Gallant, J.R. The transcriptional correlates of divergent electric organ discharges in Paramormyrops electric fish. BMC Evol Biol 20, 6 (2020). https://doi.org/10.1186/s12862-019-1572-3

This repository contains files with the code we used in our analysis.

The table below serves as a guide to understand the flow of the code. It details the order in which the code was executed, along with a description and comments of each step. Notes are shown in bold text.

Note: that a Singularity file is provided in the folder trinity_singularity to run on high performance computing systems. This would allow any user capable of running Singularity images to recreate the exact computing environment used for these analyses, though it is not required.

script/command file description comments additional_outputs (These are provided in the folder named additional_files)
sh_01_FastQCraw.sh assess quality of raw reads
sh_02_trim_rename_unzip.sh trim, rename and unzip reads
sh_03_FastQCtrimmed.sh assess quality of trimmed reads
The NCBI transcripts file we used as reference for the align and count steps was from: NCBI Paramormyrops kingsleyae Annotation Release 100, based on genome assembly PKINGS_0.1. We downloaded the transcripts file from here: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/872/115/GCF_002872115.1_PKINGS_0.1 We used the file called: rna.fna.gz, and removed the sole rRNA transcript present: XR_002837744.1
cmd_generate_gene_to_trans_file.txt generate a gene-to-transcript list from the NCBI transcripts file this list is required by the align and count steps gene-trans-map.txt
sh_04a_RSEMindex.sh Index the NCBI transcripts file calls the singularity container
sh_04a_bash.sh Index the NCBI transcripts file executes commands within the singularity container
sh_04b_RSEMperIndiv.sh Aligns reads to NCBI transcripts file and counts reads per gene calls the singularity container
sh_04b_bash.sh Aligns reads to NCBI transcripts file and counts reads per gene executes commands within the singularity container
sh_04c_matrices.sh Build gene expression matrices calls the singularity container
sh_04c_bash.sh Build gene expression matrices executes commands within the singularity container
At this point the gene expression matrices (RSEM.gene.counts.matrix and RSEM.gene.TMM.counts.matrix ) use gene names and symbols from the NCBI transcriptome. However, EntrezGeneIDs are preferred for downstream analyses. Therefore, I converted their gene names and symbols to Pkings EntrezGeneIDs with the next R code. The converted files were assigned to the original file names. The original files were first renamed to: <orginal name>_ORIG_gene_symbols
translate_gene_IDs.Rmd
  1. Replace gene names and symbols with EntrezGeneIDs in the gene expression matrices
  2. generate a file with the columns Pking EntrezGeneID, gene name, gene symbol and type of gene for each of the predicted 27610 P. kingsleyae genes. This file is named Dic.PkingEntrezGeneID-to-name_symbol_type.txt
This code runs on the renamed files Dic.PkingEntrezGeneID-to-name_symbol_type.txt
sh_05a_DE_analyses.sh
  1. Data exploration - Correlation matrix, PCA
  2. DGE and MA plots - all 10 possible pairwise OTU comparisons
calls the singularity container
sh_05a_bash_DE_genes.sh
  1. Data exploration - Correlation matrix, PCA
  2. DGE and MA plots - all 10 possible pairwise OTU comparisons
  • executes commands within the singularity container
  • We modified 2) to use the function estimateDisp() instead of the functions estimateCommonDisp() and estimateTagwiseDisp()
  • uses the samples.txt file
    Clustering_of_DEG_mean.Rmd
    1. For each phenotype pair, extract the genes that meet the expression filters (Set B groups)
    2. plot expression patterns of the genes in each group from 1)
    generates black & white and colored plots for Set B genes (These plots served informational purposes)
    generate_suppl_files_DEG_comparisons_and_groups.Rmd generate the supplemental files with the details of the
    1. 10 DGE comparisons and
    2. Set B groups
    sh_06_blastp.sh blast P. kingsleyae proteins to D. rerio proteins output is split into 7 files, we merged all to one file afterwards
    Annotation_wrangling.Rmd For each ontology, generate two 'dictionaries':
    1. Pking Entrez Gene IDs to D. rerio GO IDs
    2. D. rerio GO IDs to GO terms
    Files from 2) were not used in later scripts, they served as references
    1. Dic.PkingEntrezGeneID-to-GO.{ontology}.txt
    2. Dic.{ontology}.GOid_to_term.txt
    enrichment_on_Pkings _all_10_DGE_comparisons.Rmd
    1. GO enrichment on all 10 DGE comparisons
    2. Horizontal bar plot significant GO terms
  • Xcel file from 1) is part of the supplementary files
  • This code also produces a file with information on each upregulated gene annotated to enriched GO terms, including how many GO terms the gene was annotated to for a given upregulated list and ontology (frequency). The file served informational purposes
  • enrichment_on_Pkings_clusters.Rmd
    1. GO enrichment on Set B groups
    2. Horizontal bar plot significant GO terms
  • Xcel file from 1) is part of the supplementary files
  • the horizontal bar plot from 2) served informational purposes)
  • This code also produces a file with information on each upregulated gene annotated to enriched GO terms, including how many GO terms the gene was annotated to for a given upregulated list and ontology (frequency). The file served informational purposes
  • set_C.Rmd
    1. Intersect upregulated genes from Sets A' and B (these intersected genes are Set C)
    2. GO enrichment on Set C genes
    3. plot expression patterns of Set C genes
    4. Horizontal bar plot significant GO terms
    The outputs are:
    1. one file per list of upregulated genes
    2. one file per list of enriched GO terms
    3. Xcel file with upregulated genes (consolidation of output 1)
    4. Xcel file with enriched GO terms (consolidation of output 2)
    5. Xcel file with information on each upregulated gene annotated to enriched GO terms, including how many GO terms the gene was annotated to for a given upregulated list and ontology (frequency). The file served informational purposes
    6. Color plots for Set C genes expression patterns
    7. Horizontal bar plot with enriched GO terms
  • Outputs 3) and 4) are part of the supplemental files
  • Outputs 6) and 7) make up Figs. 4-6
  • About

    Analysis scripts and code for Paramormyrops RNA-seq project

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages