Skip to content

Miscellaneous plotting

Haibao Tang edited this page Jul 1, 2024 · 6 revisions

We have included a number of visualization routines in the jcvi package that might be useful to create customized plots. This is a common visualization task in genomics - to see where the various features are.

Tip

You don't actually need test data for this section. However, you can still download the following data and run script here.

Chromosome painting

Sometimes we wish to show categories for a number of chromosome regions in a genome. Let's say we want to plot the Arabidopsis genome. First, let's make a file that contains the chromosome sizes:

Chr1	30427671
Chr2	19698289
Chr3	23459830
Chr4	18585056
Chr5	26975502

This file can be created from the genome FASTA file and save it in athaliana.sizes. Now let's add a number of interesting features:

Chr1	1000000	3000000	type A
Chr4	12000000	15000000	type A
Chr2	5000000	10000000	type B
Chr5	20000000	22000000	type B
Chr2	15000000	18000000	LINE/SINE
Chr5	13000000	15000000	LINE/SINE
Chr3	2000000	3500000	LTR

This is known as the bed format, which must be tab-separated. Note that we use the functional class as the feature name. Since there are 4 classes, there will be 4 colors, one for each color. The command we use is:

python -m jcvi.graphics.chromosome features.bed \
    --size=athaliana.sizes --title="*Arabidopsis* genome features" --gauge

features_nocent

What if we want to include centromeres? Append the following lines to features.bed:

Chr1	14511721	14538721	centromere
Chr2	3611838	3611883	centromere
Chr3	13589756	13589816	centromere
Chr4	3133663	3133674	centromere
Chr5	11194537	11194848	centromere

Run the above command again, now with features indicating where the centromeres are. We have the following output:

features

Some users may wish to change the labels or the colors of the features. In that case, use a .idmap file which is another tab-separated file for example athaliana.idmap:

LINE/SINE	LS	#377EB8
LTR	LT	#4DAF4A
type A	TA	#984EA3
type B	TB	#FF7F00

Column 2 is the legend label, while column 3 is the desired color. We then pass this file to the command line:

python -m jcvi.graphics.chromosome features.bed athaliana.idmap \
    --size=athaliana.sizes --title="*Arabidopsis* genome features" --gauge

features_idmap

Genomic landscape

JCVI offers capabilities for genome landscape visualization, which is crucial in genome annotation. This includes the ability to create heatmaps that represent genome features alongside genome size. For instance, a user may want to visualize a genome of interest with specific genomic features using a heatmap.

The following steps look daunting, but are simply data preparations. Just bear with me.

First, download the standard GFF annotations for genes and repeats.

# Download the complete Arabidopsis thaliana genome sequence (TAIR10)
wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas.gz

# Uncompress the genome sequence file
gunzip TAIR10_chr_all.fas.gz

# Download the gene annotation file in GFF3 format for TAIR10
wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff

# Download the repeat annotations from a custom database, generated by EDTA
wget https://raw.githubusercontent.com/wyim-pgl/test_files/main/EDTA/Athaliana_167.fa.mod.EDTA.TEanno.gff3

# Define variables for file paths
REPEAT=Athaliana_167.fa.mod.EDTA.TEanno.gff3
GENOME=TAIR10_GFF3_genes.gff

We would then convert these annotations to BED format for easier manipulation and visualization.

# Convert specific repeat types from GFF3 to BED format for easier manipulation and visualization
python -m jcvi.formats.gff bed --type helitron --key=ID $REPEAT -o Helitron.bed
python -m jcvi.formats.gff bed --type hAT_TIR_transposon --key=ID $REPEAT -o hAT.bed
python -m jcvi.formats.gff bed --type Gypsy_LTR_retrotransposon --key=ID $REPEAT -o Gypsy.bed
python -m jcvi.formats.gff bed --type Copia_LTR_retrotransposon --key=ID $REPEAT -o Copia.bed

# Convert all repeat regions to BED format for landscape visualization
python -m jcvi.formats.gff bed --type all $REPEAT -o Repeats.bed

This pulls out different repeat types of interest and place them in separate BED files. Then we are ready to plot:

python -m jcvi.graphics.landscape stack TAIR10_chr_all.fas \
    --stacks=Repeats,Exons \
    --window=250000 \
    --shift=50000

The window and shift control the granularity of the stack plots (larger window means more smooth).

TAIR10_chr_all.png

You can also zoom onto a single chromosome, with specified stacks and heatmaps. Note that the heatmaps of various genomic features are only available in the per-chromosome mode.

python -m jcvi.graphics.landscape heatmap TAIR10_chr_all.fas Chr2 \
    --stacks=Repeats,Exons \
    --heatmaps=Copia,Gypsy,Helitron,hAT,Exons 

Chr2.png