codeathon.tex

\section{The Metagenomics Benchmarking Codeathon}
\label{sec:psss-codeathon}

The National Institutes of Health (NIH) Office of Data Science Strategy (ODSS), the National Library of Medicine’s (NLM’s) National Center for Biotechnology and Information (NCBI), and the Department of Energy’s (DOE’s) Office of Biological and Environmental Research (BER) hosted scientists from around the world to participate in a virtual Petabyte-Scale Sequence Search: Metagenomics Benchmarking Codeathon. The codeathon, held September 27-October 1, 2021, attracted experts from national laboratories including the Los Alamos National laboratory, research institutions including the Joint Genome Institute, and students from universities across the world to develop benchmarking approaches to address challenges in conducting large-scale analyses of metagenomic data.

To take advantage of this growing collection of biomedical data, there is a need for efficient methods to search the archive using nucleotide sequences. Just as the introduction of tools like Basic Local Alignment Search Tool (BLAST) provided a key to unlock the potential of the GenBank archive, similar approaches are needed for SRA. Towards these efforts, we have developed an interagency Emerging Solutions in Petabyte Scale Sequence Search (ESPSSS) initiative which hosted its first workshop in June. Explaining the impetus for the workshop, Dr. Susan Gregurick, NIH Associate Director for Data Science and ODSS Director, said: “{\it We all share a common problem and a need to develop, enhance, and implement methods that streamline data access, search or findability, and ultimately data reuse}.” Dr. Todd Anderson, the Biological Systems Science Division Director from the DOE, added that “{\it There is much to be gained from employing big data technology to assist with experimentation in biological sciences}.”

As metagenomic samples comprise more than 30\% of the sequence records in SRA, ESPSSS is initially focusing on metagenomic benchmarking. In the spirit of developing community driven solutions, ESPSSS hosted the virtual Petabyte-Scale Sequence Search: Metagenomics Benchmarking Codeathon in September to bring together students, researchers, and computing professionals to collaborate on developing sequence search benchmarking approaches.
Collaborative work by codeathon participants—who were split into four teams— generated the following proof-of-concept or early-stage solutions:
\begin{enumerate}
    \item a pipeline used for the identification of metagenomic samples with user-provided long sequence queries,
    \item a gold-standard dataset and pipeline to benchmark contig containments,
    \item a benchmark harness for read/contig tools, and
    \item a pipeline to combine an experimental SRA sequence index with BLAST.
\end{enumerate}