-
Notifications
You must be signed in to change notification settings - Fork 34
Metagenomic Co Assembly Proof of Concept
By Tomer Altman, Connor Morgan-Lang, and Ryan McLaughlin
We set out to demonstrate that when you co-assemble read sets from many samples from a common source, you can recover more complete genomes, and genomes of rare members of the community. To this end, we fetched all 50 runs of PRJNA602689
, which contains runs SRR10951656
and SRR10951660
; the "low coverage Infectious Bronchitis Virus in Sus scrofa" study. So while we did not perform assembly for all of the benchmark datasets, we zeroed in on a particular use case to see if we could recover additional viral contigs. Below we present our approach, our findings, and instructions on fetching the data resources.
We trimmed reads using XXXX. For performing the co-assembly we used MegaHIT version XXXX. We used MetaBat2 to perform binning of the contigs. We ran CheckV on all bins to check for presence and quality of recovered viral contigs. We ran NCBI Blast in blastn mode
against a local copy of the NT database, restricting the search to only viral entries, to annotate the contigs.
The entire workflow except for the Blast annotation was performed at a UBC computing grid, on a machine with XXXX cores and approximately XXXX GB RAM.
The combined read pairs numbered XXXX, and took up XXXX GB on disk. It took XXXX minutes to run the co-assembly, with an approximate RAM utilization of XXXX. The binning took XXXX minutes. Running CheckV on all of the bins took XXXX minutes.
We recovered a bin of IBV with 27,753 bp across three contigs of length 15,999, 8,971, 2,783, respectively. While the right target size for a coronavirus, CheckV declared the largest contig of medium-quality, and the other contigs of low quality.
CheckV flagged the following bins as being of high quality:
checkv_bin_12/quality_summary.tsv:k119_5808 15205 1.0 6 6 0 High-quality High-quality 99.74 AAI-based 0.0 No
checkv_bin_17/quality_summary.tsv:k119_6451 6458 1.0 3 3 0 High-quality High-quality 99.35 AAI-based 0.0 No
checkv_bin_47/quality_summary.tsv:k119_5360 6670 1.0 3 3 0 High-quality High-quality 100.0 AAI-based 0.0 No
Looking up their annotation from Blast:
k119_5808 MH996950.1 11176 Avian avulavirus 1 15205 15147 15138 99.908 99 27237 0.0
k119_6451 KP747574.1 1239567 Mamastrovirus 3 6458 6402 6400 95.828 99 10356 0.0
k119_5360 MK613068.1 1105379 Porcine astrovirus 4 6670 6733 6770 77.947 99 5383 0.0
So we recovered two high-quality genomes, one for Avian avulavirus 1 and one for Mamastrovirus 3, which have high identity to the reference database. And we recovered a third high-quality genome, Porcine astrovirus 4, which has 77.95 % identity to the closest reference sequence in the database.