Use assembly-stats
to check the quality of some of your assemblies produced during the workshop. Which assembly you produced for Salmonella (considering Illumina and Nanopore data) has the best N50?
mamba create -y -p envs/assembly-stats assembly-stats
conda activate envs/assembly-stats
# Just an example, for the flye assembly
assembly-stats flye_output/assembly.fasta
Use quast
to compare different statistics for selected assemblies. For example, chose the assemblies you produced with Illumina and Nanopore as an input. Investigate the output and reports. Do you agree with quast
regarding what the "best" assembly is?
Check the --help
and chose appropriate parameters to run the tool.
mamba create -y -p envs/quast quast
conda activate envs/quast
#quast --help
A neat way to do so is using so-called IDEEL plots as initially proposed by Mick Watson. Different people implemented this approach, for example github.com/phiweger/ideel.
Here, we will combine various tasks to get this approach running.
Lets first install the code from this repository and the necessary dependencies via mamba
. Then, we generate the protein database index, prepare the input genome files and run the workflow.
mamba create -y -p envs/ideel snakemake prodigal diamond r-ggplot2 r-readr
conda activate envs/ideel
wget "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz"
git clone https://github.com/phiweger/ideel.git
cd ideel
# get a reference database of protein sequences
# generate an index with diamond
mv ../uniprot_sprot.fasta.gz .
diamond makedb --in uniprot_sprot.fasta.gz -d database_uniprot
# The output of the workflow will be written to --directory.
mkdir ideel-results
# In there, make a directory called "genomes"
mkdir ideel-results/genomes
# put assemblies in there with .fa file extension
cp ../ecoli-reference.fna ideel-results/genomes/eco-reference.fa
cp ../flye_output/assembly.fasta ideel-results/genomes/eco-consensus-flye.fasta
cp ../eco-consensus-racon.fasta ideel-results/genomes/eco-consensus-racon.fasta
cp ../eco-medaka/consensus.fasta ideel-results/genomes/eco-consensus-medaka.fasta
# The workflow wants the files to have .fa instead of .fasta!
rename 's/\.fasta$/.fa/' ideel-results/genomes/*.fasta
# run the workflow
snakemake --configfile config.json --config db=../database_uniprot.dmnd --directory ideel-results/ --cores 4
cd ..
Investigate the final output plots. How fragmented are your ORFs? Run the workflow on an assembly produced with Illumina and Nanopore. How does the fragmentation of your ORFs change when you polish your assembly?
When you have both Nanopore and Illumina data available for the same sample it can be worth to assemble the data "together". Commonly used for this task is (was) Unicycler
.
Exercise
- make a
mamba
environment and installUnicycler
- check out the
Unicycler
code repository - .. and the
--help
to learn how to use the tool - provide short and long reads as input
- you can decide to use the raw or length-filtered long reads
- In the section above about Illumina short-read assembly you will find the paths to the Illumina reads
- how does the resulting assembly compare to the short- and long-read-only assemblies?
Note that Unicycler
internally uses SPAdes
as a first step. That means that the assembly process will start from short reads. This can introduce problems early that can be also not solved at later steps. Because of that, it can be better to start with an initial long-read assembly (e.g. flye
) and then introduce the short-read data later for polishing (e.g. polypolish
) instead of using Unicycler
.
Remember that you can compare different assemblies via quast
!
Nowadays, Nanopore sequencing improved that much that in many cases it does not make sense anymore to run short-read-first assemblies. Ryan Wick, the original developer of Unicycler
, Polypolish
, Trycycler
, ... wrote some very good explanations about this.
Exercise
- use
Trycycler
to combine the assembly results you have from different runs of the same sample into a consensus long-read assembly
We can use sourmash
to classify the assembled contigs taxonomically using reference sequences. An alternative could be kraken2
on read level, for example.
- compute sequence signatures for inputs (
compute
) - select 10000 hashes of input k-mers (
--scaled 10000
) - select 31 as k-mer size (
--k 31
)
mamba create -y -p envs/sourmash sourmash
conda activate envs/sourmash
# download sourmash GTDB index v202 w/ taxonomy
# GTDB genomic representatives (47.8k genomes), LCA, kmer 31 --> https://sourmash.readthedocs.io/en/latest/databases.html
wget --no-check-certificate https://osf.io/ypsjq/download -O gtdb-rs202.genomic-reps.k31.lca.json.gz
DB='gtdb-rs202.genomic-reps.k31.lca.json.gz'
# calculate signatures for your genome
sourmash compute --scaled 10000 -k 31 eco-medaka/consensus.fasta -o sourmash.sig
sourmash lca classify --db $DB --query sourmash.sig -o taxonomic-classification.txt
#check results
cat taxonomic-classification.txt
Additional information on Sourmash: Code | Publication