Metagenomic Co Assembly Proof of Concept

By Tomer Altman, Connor Morgan-Lang, and Ryan McLaughlin

Abstract

We set out to demonstrate that when you co-assemble read sets from many samples from a common source, you can recover more complete genomes, and genomes of rare members of the community. To this end, we fetched all 50 runs of PRJNA602689, which contains runs SRR10951656 and SRR10951660; the "low coverage Infectious Bronchitis Virus in Sus scrofa" study. So while we did not perform assembly for all of the benchmark datasets, we zeroed in on a particular use case to see if we could recover additional viral contigs. Below we present our approach, our findings, and instructions on fetching the data resources.

Methods

We trimmed reads using XXXX. For performing the co-assembly we used MegaHIT version XXXX. We used MetaBat2 to perform binning of the contigs. We ran CheckV on all bins to check for presence and quality of recovered viral contigs. We ran NCBI Blast in blastn mode against a local copy of the NT database, restricting the search to only viral entries, to annotate the contigs.

The entire workflow except for the Blast annotation was performed at a UBC computing grid, on a machine with XXXX cores and approximately XXXX GB RAM.

Results

Pipeline Performance

The combined read pairs numbered XXXX, and took up XXXX GB on disk. It took XXXX minutes to run the co-assembly, with an approximate RAM utilization of XXXX. The binning took XXXX minutes. Running CheckV on all of the bins took XXXX minutes.

Viral Findings

Infectious Bronchitis Virus

We recovered a bin of IBV with 27,753 bp across three contigs of length 15,999, 8,971, 2,783, respectively. While the right target size for a coronavirus, CheckV declared the largest contig of medium-quality, and the other contigs of low quality.

High-Quality Bins

CheckV flagged the following bins as being of high quality:

checkv_bin_12/quality_summary.tsv:k119_5808     15205   1.0     6       6       0       High-quality    High-quality    99.74   AAI-based       0.0     No
checkv_bin_17/quality_summary.tsv:k119_6451     6458    1.0     3       3       0       High-quality    High-quality    99.35   AAI-based       0.0     No
checkv_bin_47/quality_summary.tsv:k119_5360     6670    1.0     3       3       0       High-quality    High-quality    100.0   AAI-based       0.0     No

Looking up their annotation from Blast:

k119_5808       MH996950.1      11176   Avian avulavirus 1      15205   15147   15138   99.908  99      27237   0.0
k119_6451       KP747574.1      1239567 Mamastrovirus 3 6458    6402    6400    95.828  99      10356   0.0
k119_5360       MK613068.1      1105379 Porcine astrovirus 4    6670    6733    6770    77.947  99      5383    0.0

So we recovered two high-quality genomes, one for Avian avulavirus 1 and one for Mamastrovirus 3, which have high identity to the reference database. And we recovered a third high-quality genome, Porcine astrovirus 4, which has 77.95 % identity to the closest reference sequence in the database.

Data Files

Overview

Architecture and Pipeline

Raw Data

Serratus Explorer (serratus.io)

Usage

Running Serratus
- Serratus-Lite, local
Finding Novel Viruses (tutorials)
Papers using Serratus
Containers
Summarizer usage
Cloud Budgeting
Serratus SQL Database Management
Data Policy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly