-
Notifications
You must be signed in to change notification settings - Fork 186
Miscellaneous plotting
We have included a number of visualization routines in the jcvi
package that might be useful to create customized plots. This is a common visualization task in genomics - to see where the various features are.
Tip
You don't actually need test data for this section. However, you can still download the following data and run script here.
Sometimes we wish to show categories for a number of chromosome regions in a genome. Let's say we want to plot the Arabidopsis genome. First, let's make a file that contains the chromosome sizes:
Chr1 30427671
Chr2 19698289
Chr3 23459830
Chr4 18585056
Chr5 26975502
This file can be created from the genome FASTA file and save it in athaliana.sizes
. Now let's add a number of interesting features:
Chr1 1000000 3000000 type A
Chr4 12000000 15000000 type A
Chr2 5000000 10000000 type B
Chr5 20000000 22000000 type B
Chr2 15000000 18000000 LINE/SINE
Chr5 13000000 15000000 LINE/SINE
Chr3 2000000 3500000 LTR
This is known as the bed
format, which must be tab-separated. Note that we use the functional class as the feature name. Since there are 4 classes, there will be 4 colors, one for each color. The command we use is:
python -m jcvi.graphics.chromosome features.bed \
--size=athaliana.sizes --title="*Arabidopsis* genome features" --gauge
What if we want to include centromeres? Append the following lines to features.bed
:
Chr1 14511721 14538721 centromere
Chr2 3611838 3611883 centromere
Chr3 13589756 13589816 centromere
Chr4 3133663 3133674 centromere
Chr5 11194537 11194848 centromere
Run the above command again, now with features indicating where the centromeres are. We have the following output:
Some users may wish to change the labels or the colors of the features. In that case, use a .idmap
file which is another tab-separated file for example athaliana.idmap
:
LINE/SINE LS #377EB8
LTR LT #4DAF4A
type A TA #984EA3
type B TB #FF7F00
Column 2 is the legend label, while column 3 is the desired color. We then pass this file to the command line:
python -m jcvi.graphics.chromosome features.bed athaliana.idmap \
--size=athaliana.sizes --title="*Arabidopsis* genome features" --gauge
JCVI offers capabilities for genome landscape visualization, which is crucial in genome annotation. This includes the ability to create heatmaps that represent genome features alongside genome size. For instance, a user may want to visualize a genome of interest with specific genomic features using a heatmap.
The following steps look daunting, but are simply data preparations. Just bear with me.
First, download the standard GFF annotations for genes and repeats.
# Download the complete Arabidopsis thaliana genome sequence (TAIR10)
wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas.gz
# Uncompress the genome sequence file
gunzip TAIR10_chr_all.fas.gz
# Download the gene annotation file in GFF3 format for TAIR10
wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff
# Download the repeat annotations from a custom database, generated by EDTA
wget https://raw.githubusercontent.com/wyim-pgl/test_files/main/EDTA/Athaliana_167.fa.mod.EDTA.TEanno.gff3
# Define variables for file paths
REPEAT=Athaliana_167.fa.mod.EDTA.TEanno.gff3
GENOME=TAIR10_GFF3_genes.gff
We would then convert these annotations to BED
format for easier manipulation and visualization.
# Convert specific repeat types from GFF3 to BED format for easier manipulation and visualization
python -m jcvi.formats.gff bed --type helitron --key=ID $REPEAT -o Helitron.bed
python -m jcvi.formats.gff bed --type hAT_TIR_transposon --key=ID $REPEAT -o hAT.bed
python -m jcvi.formats.gff bed --type Gypsy_LTR_retrotransposon --key=ID $REPEAT -o Gypsy.bed
python -m jcvi.formats.gff bed --type Copia_LTR_retrotransposon --key=ID $REPEAT -o Copia.bed
# Convert all repeat regions to BED format for landscape visualization
python -m jcvi.formats.gff bed --type all $REPEAT -o Repeats.bed
This pulls out different repeat types of interest and place them in separate BED
files. Then we are ready to plot:
python -m jcvi.graphics.landscape stack TAIR10_chr_all.fas \
--stacks=Repeats,Exons \
--window=250000 \
--shift=50000
The window
and shift
control the granularity of the stack plots (larger window means more smooth).
You can also zoom onto a single chromosome, with specified stacks
and heatmaps
. Note that the heatmaps of various genomic features are only available in the per-chromosome mode.
python -m jcvi.graphics.landscape heatmap TAIR10_chr_all.fas Chr2 \
--stacks=Repeats,Exons \
--heatmaps=Copia,Gypsy,Helitron,hAT,Exons
© Haibao Tang, 2010-2024