Merge pull request #7 from amoeba/joss-review

Paper changes for JOSS Review
pierrepeterlongo · Aug 27, 2024 · dcbd0ae · dcbd0ae
2 parents 496a99f + bc91b57
commit dcbd0ae
Showing 1 changed file with 15 additions and 149 deletions.
diff --git a/paper/paper.md b/paper/paper.md
@@ -8,13 +8,13 @@ tags:
 authors:
   - name: Anthony Baire
     affiliation: 1
-  - name: Pierre Marijon 
+  - name: Pierre Marijon
     orcid: 0000-0002-6694-6873
     affiliation: 2
   - name: Francesco Andreace
     orcid: 0009-0008-0566-200X
     affiliation: 3
-  - name: Pierre Peterlongo 
+  - name: Pierre Peterlongo
     orcid: 0000-0003-0776-6407
     affiliation: 1
 affiliations:
@@ -25,7 +25,7 @@ affiliations:
  - name: Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015, France
    index: 3
 date: 10 June 2024
-bibliography: paper.bib 
+bibliography: paper.bib
 
 
 ---
@@ -48,9 +48,7 @@ A vast majority of bioinformatics tools dedicated to the treatment of
   $k$-mers in hundreds of millions of reads in a matter of a few
   minutes.
 
-  Availability: github.com/pierrepeterlongo/back_to_sequences
-
-# Statement of need
+# Statement of Need
 
 In the 2010s, following the emergence of next-generation sequencing
 technology, read assembly strategies based on the
@@ -64,7 +62,7 @@ unknown sequencing strand) were complex to handle with OLC while being
 easy to handle or simply solved with the dBG
 approach [@li2012comparison].
 
-Recall that in the dBG assembly approach, 1. all $k$-mers(words of
+Recall that in the dBG assembly approach, 1. all $k$-mers (words of
 length $k$) from a set of reads are counted; 2. those with an abundance
 lower than a threshold are considered to contain sequencing errors
 and are discarded; 3. the remaining $k$-mers are organized in a dBG; 4.
@@ -86,15 +84,15 @@ discovery [@uricaru2015reference] to cite only a few.
 One of the keys to the success of the use of $k$-mers is its low
 resource needs. Whatever the sequencing coverage, once
 filtered, the number of distinct $k$-mers is at most equal to the
-original genome size. This offers a minimal impact on RAM and/or disk
-needs. However, this comes at the cost of losing the link between each
-$k$-mer and the sequences(s) from which it originates. Storing
-explicitly these links would reintroduce the problem associated with the
+original genome size. This offers a minimal impact on random access memory
+(RAM) and/or disk needs. However, this comes at the cost of losing the link
+between each $k$-mer and the sequence(s) from which it originates. Storing
+these links explicitly would reintroduce the problem associated with the
 abundance of original reads, as the link between each original read and
 each of its $k$-mers would have to be stored. For instance, considering
 $k$-mers from a sequencing experiment of a human genome ($\approx 3$
-billions nucleotides) with a coverage of 50x (each $k$-mer occurs on
-average in 50 distinct reads) would require more than 2Tb of space
+billion nucleotides) with a coverage of 50x (each $k$-mer occurs on
+average in 50 distinct reads) would require more than 2TB of space
 considering 64 bits for storing each link and 64 bits for storing the
 associated read identifier. This is not acceptable.
 
@@ -121,104 +119,9 @@ In this context, we propose `back_to_sequences`, a tool specifically
 dedicated to extracting from $\mathcal{S}$, a set of sequences (eg.
 reads), those that contain some of the $k$-mers from a set $\mathcal{K}$
 given as input. The occurrences of $k$-mers in each sequence of the
-queried set $\mathcal{S}$ can also be output. In addition to this
-fundamental task, additional features are described in the following.
-
-# Results
-
-The tool we introduce, `back_to_sequences`, uses the native `rust` data
-structures to index and query $k$-mers. Its advantages are 1. its
-simplicity (install and run); 2. its low RAM usage; 3. its fast running
-time; 4. its additional features adapted to $k$-mers: control of their
-orientation (see below) and sequence filtration based on the minimal and
-maximal percent of $k$-mers shared with the indexed set. To the best of
-our knowledge, there exists no other tool dedicated to this specific
-task.
-
-In the general case, the DNA sequence orientation is unknown. DNA can be
-sequenced either in one strand (eg $AAGGC$) or in the reverse complement
-strand, read from right to left and commutating $A$ with $T$, and $C$
-with $G$ (eg $GCCTT$). This is why $k$-mers from $\mathcal{S}$ and
-$\mathcal{K}$ can be considered either as original or as *canonical*,
-the smallest alphabetical word between the $k$-mer and its reverse
-complement.
-
-We propose a simple performance benchmark to assess the
-`back_to_sequences` scaling performances on random sequences and on real
-metagenomic data.
-
-## Benchmark on random sequences {#ssec:random_banch}
-
-We generated a random sequence $s$ on the alphabet $\{A,C,G,T\}$ of size
-100 million base pairs. From this sequence $s$, we randomly extracted
-50,000 sub-sequences each of size 50. We consider these sequences as
-containing the set $\mathcal{K}$ of $k$-mers to be searched. As we used
-$k=31$, each subsequence contains $50-31+1 = 20 kmers$. Doing so, we
-consider a set of at most $50000\times 20 = 1,000,000$ $k$-mers.
-
-We also generated six sets:
-$\{\mathcal{S}_{10k}, \mathcal{S}_{100k}, \mathcal{S}_{1M}, \mathcal{S}_{10M}, \mathcal{S}_{100M},  \mathcal{S}_{200M}\}$,
-composed respectively of 10 thousand, 100 thousand, one million, 10
-million, 100 million, and 200 million sequences, each of length 100
-nucleotides. Each sequence is randomly sampled from $s$.
-
-In each sequence of each set $\mathcal{S}_i$, we searched for the
-existence of $k$-mers indexed in $\mathcal{K}$. The performances are
-provided Table [1](#tab:res_bench){reference-type="ref"
-reference="tab:res_bench"}. Presented results were obtained on the
-GenOuest platform on a node with 32 threads Xeon 2.2 GHz, on a
-MacBook, Apple M2 pro, 16 GB RAM with 10 threads, and an AMD Ryzen
-7 4.2 GHz 5800X 64 GB RAM with 16 threads, respectively denoted by "Time GenOuest",
- "Time mac" and "Time AMD" in this table. These results highlight
-the scalability of the `back_to_sequences` tool, able to search a
-million of $k$-mers in hundreds of millions of reads on a laptop
-computer in a matter of dozens of minutes with negligible RAM usage.
-
-::: {#tab:res_bench}
-    Number of reads  Time GenOuest   Time mac   Time AMD   max RAM
-  ----------------- --------------- ---------- ---------- ---------
-             10,000      0.7s          0.5s       0.4s     0.13 GB
-            100,000      0.8s          0.8s       1.2s     0.13 GB
-          1,000,000      2.0s          3.5s       7,1s     0.13 GB
-         10,000,000      7.1s          11s        16s      0.13 GB
-        100,000,000      47s           58s        48s      0.13 GB
-        200,000,000      1m32          1m52       1m44     0.13 GB
-
-  : The `back_to_sequences` performances searching one millions $k$-mers
-  in 10 thousand to 100 million reads. Tested version: 2.6.0.
-:::
-
+queried set $\mathcal{S}$ can also be output.
 
-## Benchmark on *Tara* ocean seawater metagenomic data
-
-The previously proposed benchmark shows the scalability of the proposed
-approach. Although performed on random sequences, there is no objective
-reason why performance should differ on real data, regardless of the
-number of $k$-mers actually detected in the data. We verify this claim
-by applying `back_to_sequences` to real complex metagenomic sequencing
-data from the *Tara* ocean project [@Sunagawa2020improved].
-
-We downloaded one of the *Tara* ocean read sets: station number 11
-corresponding to a surface Mediterranean sample, downloaded from the
-European Nucleotide Archive, identifier ERS488262[^1]. We extracted the
-first 100 million reads, which are all of length 100. Doing so, this
-test is comparable to the result presented
-Table [1](#tab:res_bench){reference-type="ref"
-reference="tab:res_bench"} querying 100 million random reads. Using
-`back_to_sequences` we searched in these reads each of the 69 31-mers
-contained in its first read. On the GenOuest node, `back_to_sequences`
-enabled to retrieve all reads that contain at least one of the indexed
-$k$-mers in 5m17 with negligible RAM usage of 45MB. As expected, these
-scaling results are in line with the results presented
-Table [1](#tab:res_bench){reference-type="ref"
-reference="tab:res_bench"}.
-
-Out of curiosity, we ran `back_to_sequences` on the full read set,
-composed of $\approx 26.3$ billion $k$-mers, and 381 million reads,
-again for searching the 69 $k$-mers contained in its first read. This
-operation took 20m11.
-
-## Possible alternatives
+# Possible Alternatives
 
 To the best of our knowledge, there exists no tool specifically
 dedicated to this task, while indexing the set of $k$-mers$\mathcal{K}$.
@@ -259,45 +162,6 @@ potentially its reverse complement. Finally, these tools do not easily
 provide the number of occurrences or occurrence positions of each of the
 searched patterns when there are many.
 
-# Method and features
-
-`back_to_sequences` is written in `rust`. It uses the native HashMap for storing the searched $k$-mer set,
-with alternative aHash [@zhao2020ahash] hash function.
-Depending on the user choice, the original or the canonical version of
-each $k$-mer from the "reference-$k$-mers" set is indexed. The source code
-is unit-tested and functionally tested using tools from the Rust community.
-
-At query time, given a sequence $s$ from $\mathcal{S}$, all its $k$-mers
-are extracted and queried. Depending on the user's choice, the canonical
-or the original representation of each $k$-mer from the reference set is
-indexed. Again, depending on the user's choice, for each queried
-$k$-mer, the original version, its reverse complement, or its canonical
-representation is queried. If the queried $k$-mer belongs to the index,
-its number of occurrences is increased. The number of $k$-mers extracted
-from the sequence $s$ that have a match with the index is output
-together with the original sequence. As an option, in addition to only
-counting the number of occurrences of each $k$-mer from $\mathcal{K}$ in
-$\mathcal{S}$, `back_to_sequences` enables to also record their
-occurrence positions and orientation and to output this information.
-Sequences are queried in parallel.
-
-A minimal and a maximal threshold can be fixed by the user. A queried
-sequence is output only if its percentage of $k$-mers that belong to the
-searched $k$-mers is strictly higher than the minimal threshold and
-lower or equal to the maximal threshold. While the minimal threshold
-enables to focus on sequences that are similar enough with a set of
-$k$-mers, the maximal threshold offers for instance a way to remove
-contaminated sequences.
-
-The sequences to be queried can be provided as a fasta or fastq file
-(gzipped or not). They can also be read directly from the standard
-input (*stdin*). This offers the may to stream sequences as they arrive,
-for instance when they are obtained during an Oxford Nanopore sequencing
-process.
-
-The output format of queried sequences respects the input format: if the
-input is in fasta (resp. fastq), the output is in fasta (resp. fastq).
-
 # Conclusion
 
 We believe that `back_to_sequences` is a generic and handy tool that
@@ -317,3 +181,5 @@ SeqDigger (ANR-19-CE45-0008), and by the Inria Challenge "OmicFinder"
 (<https://project.inria.fr/omicfinder/>).
 
 [^1]: <http://www.ebi.ac.uk/ena/data/view/ERS488262>
+
+# References