nf-core · jfy133 · Mar 16, 2021 · Jan 14, 2021 · Jan 14, 2021 · Jan 14, 2021
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -20,7 +20,7 @@ jobs:
     strategy:
       matrix:
         # Nextflow versions: check pipeline minimum and current latest
-        nxf_ver: ['20.04.0', '']
+        nxf_ver: ['20.07.1', '']
     steps:
       - name: Check out pipeline code
         uses: actions/checkout@v2
@@ -34,13 +34,13 @@ jobs:
 
       - name: Build new docker image
         if: env.MATCHED_FILES
-        run: docker build --no-cache . -t nfcore/eager:2.3.1
+        run: docker build --no-cache . -t nfcore/eager:2.3.2
 
       - name: Pull docker image
         if: ${{ !env.MATCHED_FILES }}
         run: |
           docker pull nfcore/eager:dev
-          docker tag nfcore/eager:dev nfcore/eager:2.3.1
+          docker tag nfcore/eager:dev nfcore/eager:2.3.2
 
       - name: Install Nextflow
         env:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,40 @@
 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
 
+## [2.3.2] - 2021-03-16
+
+### `Added`
+
+- [#687](https://github.com/nf-core/eager/pull/687) - Adds Kraken2 unique kmer counting report
+- [#676](https://github.com/nf-core/eager/issues/676) - Refactor help message / summary message formatting to automatic versions using nf-core library
+- [#682](https://github.com/nf-core/eager/issues/682) - Add AdapterRemoval `--qualitymax` flag to allow FASTQ Phred score range max more than 41
+
+### `Fixed`
+
+- [#666](https://github.com/nf-core/eager/issues/666) - Fixed input file staging for `print_nuclear_contamination`
+- [#631](https://github.com/nf-core/eager/issues/631) - Update minimum Nextflow version to 20.07.1, due to unfortunate bug in Nextflow 20.04.1 causing eager to crash if patch pulled
+- Made MultiQC crash behaviour stricter when dealing with large datasets, as reported by @ashildv
+- [#652](https://github.com/nf-core/eager/issues/652) - Added note to documentation that when using `--skip_collapse` this will use _paired-end_ alignment mode with mappers when using PE data
+- [#626](https://github.com/nf-core/eager/issues/626) - Add additional checks to ensure pipeline will give useful error if cells of a TSV column are empty
+- Added note to documentation that when using `--skip_collapse` this will use _paired-end_ alignment mode with mappers when using PE data
+- [#673](https://github.com/nf-core/eager/pull/673) - Fix Kraken database loading when loading from directory instead of compressed file
+- [#688](https://github.com/nf-core/eager/issues/668) - Allow pipeline to complete, even if Qualimap crashes due to an empty or corrupt BAM file for one sample/library
+- [#683](https://github.com/nf-core/eager/pull/683) - Sets `--igenomes_ignore` to true by default, as rarely used by users currently and makes resolving configs less complex
+- Added exit code `140` to re-tryable exit code list to account for certain scheduler wall-time limit fails
+- [#672](https://github.com/nf-core/eager/issues/672) - Removed java parameter from picard tools which could cause memory issues
+- [#679](https://github.com/nf-core/eager/issues/679) - Refactor within-process bash conditions to groovy/nextflow, due to incompatibility with some servers environments
+- [#690](https://github.com/nf-core/eager/pull/690) - Fixed ANGSD output mode for beagle by setting `-doMajorMinor 1` as default in that case
+- [#693](https://github.com/nf-core/eager/issues/693) - Fixed broken TSV input validation for the Colour Chemistry column
+- [#695](https://github.com/nf-core/eager/issues/695) - Fixed incorrect `-profile` order in tutorials (originally written reversed due to [nextflow bug](https://github.com/nextflow-io/nextflow/issues/1792))
+- [#653](https://github.com/nf-core/eager/issues/653) - Fixed file collision errors with sexdeterrmine for two same-named libraries with different strandedness
+
+### `Dependencies`
+
+- Bumped MultiQC to 1.10 for improved functionality
+- Bumped HOPS to 0.35 for MultiQC 1.10 compatibility
+
+### `Deprecated`
+
 ## [2.3.1] - 2021-01-14
 
 ### `Added`

diff --git a/Dockerfile b/Dockerfile
@@ -7,10 +7,10 @@ COPY environment.yml /
 RUN conda env create --quiet -f /environment.yml && conda clean -a
 
 # Add conda installation dir to PATH (instead of doing 'conda activate')
-ENV PATH /opt/conda/envs/nf-core-eager-2.3.1/bin:$PATH
+ENV PATH /opt/conda/envs/nf-core-eager-2.3.2/bin:$PATH
 
 # Dump the details of the installed packages to a file for posterity
-RUN conda env export --name nf-core-eager-2.3.1 > nf-core-eager-2.3.1.yml
+RUN conda env export --name nf-core-eager-2.3.2 > nf-core-eager-2.3.2.yml
 
 # Instruct R processes to use these empty files instead of clashing with a local version
 RUN touch .Rprofile

diff --git a/README.md b/README.md
@@ -1,10 +1,10 @@
-# ![nf-core/eager](docs/images/nf-core-eager_logo.png)
+# ![nf-core/eager](docs/images/nf-core_eager_logo.png)
 
 **A fully reproducible and state-of-the-art ancient DNA analysis pipeline**.
 
 [![GitHub Actions CI Status](https://github.com/nf-core/eager/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/eager/actions)
 [![GitHub Actions Linting Status](https://github.com/nf-core/eager/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/eager/actions)
-[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A520.04.0-brightgreen.svg)](https://www.nextflow.io/)
+[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A520.07.1-brightgreen.svg)](https://www.nextflow.io/)
 [![nf-core](https://img.shields.io/badge/nf--core-pipeline-brightgreen.svg)](https://nf-co.re/)
 [![DOI](https://zenodo.org/badge/135918251.svg)](https://zenodo.org/badge/latestdoi/135918251)
 
@@ -158,7 +158,10 @@ of this pipeline:
 
 Those who have provided conceptual guidance, suggestions, bug reports etc.
 
+* [Alexandre Gilardet](https://github.com/alexandregilardet)
 * Arielle Munters
+* [Charles Plessy](https://github.com/charles-plessy)
+* [Åshild Vågene](https://github.com/ashildv)
 * [Hester van Schalkwyk](https://github.com/hesterjvs)
 * [Ido Bar](https://github.com/IdoBar)
 * [Irina Velsko](https://github.com/ivelsko)

diff --git a/bin/kraken_parse.py b/bin/kraken_parse.py
@@ -19,18 +19,24 @@ def _get_args():
         default=50,
         help="Minimum number of hits on clade to report it. Default = 50")
     parser.add_argument(
-        '-o',
-        dest="output",
+        '-or',
+        dest="readout",
         default=None,
-        help="Output file. Default = <basename>.kraken_parsed.csv")
+        help="Read count output file. Default = <basename>.read_kraken_parsed.csv")
+    parser.add_argument(
+        '-ok',
+        dest="kmerout",
+        default=None,
+        help="Kmer Output file. Default = <basename>.kmer_kraken_parsed.csv")
 
     args = parser.parse_args()
 
     infile = args.krakenReport
     countlim = int(args.count)
-    outfile = args.output
+    readout = args.readout
+    kmerout = args.kmerout
 
-    return(infile, countlim, outfile)
+    return(infile, countlim, readout, kmerout)
 
 
 def _get_basename(file_name):
@@ -51,14 +57,23 @@ def parse_kraken(infile, countlim):
 
     '''
     with open(infile, 'r') as f:
-        resdict = {}
+        read_dict = {}
+        kmer_dict = {}
         csvreader = csv.reader(f, delimiter='\t')
         for line in csvreader:
             reads = int(line[1])
             if reads >= countlim:
-                taxid = line[4]
-                resdict[taxid] = reads
-        return(resdict)
+                taxid = line[6]
+                kmer = line[3]
+                unique_kmer = line[4]
+                try:
+                    kmer_duplicity = float(kmer)/float(unique_kmer)
+                except ZeroDivisionError:
+                    kmer_duplicity = 0
+                read_dict[taxid] = reads
+                kmer_dict[taxid] = kmer_duplicity
+
+        return(read_dict, kmer_dict)
 
 
 def write_output(resdict, infile, outfile):
@@ -70,10 +85,17 @@ def write_output(resdict, infile, outfile):
 
 
 if __name__ == '__main__':
-    INFILE, COUNTLIM, outfile = _get_args()
+    INFILE, COUNTLIM, readout, kmerout = _get_args()
 
-    if not outfile:
-        outfile = _get_basename(INFILE)+".kraken_parsed.csv"
+    if not readout:
+        read_outfile = _get_basename(INFILE)+".read_kraken_parsed.csv"
+    else:
+        read_outfile = readout
+    if not kmerout:    
+        kmer_outfile = _get_basename(INFILE)+".kmer_kraken_parsed.csv"
+    else:
+        kmer_outfile = kmerout
 
-    tmp_dict = parse_kraken(infile=INFILE, countlim=COUNTLIM)
-    write_output(resdict=tmp_dict, infile=INFILE, outfile=outfile)
+    read_dict, kmer_dict = parse_kraken(infile=INFILE, countlim=COUNTLIM)
+    write_output(resdict=read_dict, infile=INFILE, outfile=read_outfile)
+    write_output(resdict=kmer_dict, infile=INFILE, outfile=kmer_outfile)
diff --git a/bin/merge_kraken_res.py b/bin/merge_kraken_res.py
@@ -15,21 +15,29 @@ def _get_args():
         formatter_class=argparse.RawDescriptionHelpFormatter,
         description='Merging csv count files in one table')
     parser.add_argument(
-        '-o',
-        dest="output",
-        default="kraken_count_table.csv",
-        help="Output file. Default = kraken_count_table.csv")
+        '-or',
+        dest="readout",
+        default="kraken_read_count_table.csv",
+        help="Read count output file. Default = kraken_read_count_table.csv")
+    parser.add_argument(
+        '-ok',
+        dest="kmerout",
+        default="kraken_kmer_unicity_table.csv",
+        help="Kmer unicity output file. Default = kraken_kmer_unicity_table.csv")
 
     args = parser.parse_args()
 
-    outfile = args.output
+    readout = args.readout
+    kmerout = args.kmerout
 
-    return(outfile)
+    return(readout, kmerout)
 
 
 def get_csv():
     tmp = [i for i in os.listdir() if ".csv" in i]
-    return(tmp)
+    kmer = [i for i in tmp if '.kmer_' in i]
+    read = [i for i in tmp if '.read_' in i]
+    return(read, kmer)
 
 
 def _get_basename(file_name):
@@ -54,8 +62,9 @@ def write_csv(pd_dataframe, outfile):
 
 
 if __name__ == "__main__":
-    OUTFILE = _get_args()
-    all_csv = get_csv()
-    resdf = merge_csv(all_csv)
-    write_csv(resdf, OUTFILE)
-    print(resdf)
+    READOUT, KMEROUT = _get_args()
+    reads, kmers = get_csv()
+    read_df = merge_csv(reads)
+    kmer_df = merge_csv(kmers)
+    write_csv(read_df, READOUT)
+    write_csv(kmer_df, KMEROUT)
diff --git a/conf/base.config b/conf/base.config
@@ -14,7 +14,7 @@ process {
   memory = { check_max( 7.GB * task.attempt, 'memory' ) }
   time = { check_max( 24.h * task.attempt, 'time' ) }
 
-  errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
+  errorStrategy = { task.exitStatus in [143,137,104,134,139, 140] ? 'retry' : 'finish' }
   maxRetries = 3
   maxErrors = '-1'
 
@@ -74,38 +74,34 @@ process {
   }
 
   withName:qualimap{
-    errorStrategy = { task.exitStatus in [1,143,137,104,134,139] ? 'retry' : 'finish' }
+    errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : task.exitStatus in [255] ? 'ignore' : 'finish' }
   }
 
   withName:preseq {
     errorStrategy = 'ignore'
   }
 
   withName:damageprofiler {
-    errorStrategy = { task.exitStatus in [1,143,137,104,134,139] ? 'retry' : 'finish' }
+    errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : 'finish' }
   }
 
   // Add 1 retry for certain java tools as not enough heap space java errors gives exit code 1
   withName: dedup {
-    errorStrategy = { task.exitStatus in [1,143,137,104,134,139] ? 'retry' : 'finish' } 
+    errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : 'finish' } 
   }
 
   withName: markduplicates {
-    errorStrategy = { task.exitStatus in [143,137] ? 'retry' : 'finish' } 
+    errorStrategy = { task.exitStatus in [143,137, 140] ? 'retry' : 'finish' } 
   }
 
   // Add 1 retry as not enough heapspace java error gives exit code 1
   withName: malt {
-    errorStrategy = { task.exitStatus in [1,143,137,104,134,139] ? 'retry' : 'finish' } 
+    errorStrategy = { task.exitStatus in [1,143,137,104,134,139, 140] ? 'retry' : 'finish' } 
   }
 
   // other process specific exit statuses
   withName: nuclear_contamination {
-    errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'ignore' : 'retry' }
-  }
-
-  withName: multiqc {
-    errorStrategy = { task.exitStatus in [143,137] ? 'retry' : 'ignore' }
+    errorStrategy = { task.exitStatus in [143,137,104,134,139, 140] ? 'ignore' : 'retry' }
   }
 
 }

diff --git a/docs/images/tutorials/profiles/config_profile_inheritence.png b/docs/images/tutorials/profiles/config_profile_inheritence.png
diff --git a/docs/images/tutorials/profiles/config_profile_inheritence.svg b/docs/images/tutorials/profiles/config_profile_inheritence.svg
diff --git a/docs/output.md b/docs/output.md
@@ -551,6 +551,8 @@ Note that many of the statistics from this module are displayed in the General S
 
 You will receive output for each *sample*. This means you will statistics of deduplicated values of all types of libraries combined in a single value (i.e. non-UDG treated, full-UDG, paired-end, single-end all together).
 
+:warning: If your library has no reads mapping to the reference, this will result in an empty BAM file. Qualimap will therefore not produce any output even if a BAM exists!
+
 #### Coverage Histogram
 
 This plot shows on the Y axis the range of fold coverages that the bases of the reference genome are possibly covered by. The Y axis shows the number of bases that were covered at the given fold coverage depth as indicated on the Y axis.
@@ -598,6 +600,8 @@ Sex.DetERRmine calculates the coverage of your mapped reads on the X and Y chrom
 
 When a bedfile of specific sites is provided, Sex.DetERRmine additionally calculates error bars around each relative coverage estimate. For this estimate to be trustworthy, the sites included in the bedfile should be spaced apart enough that a single sequencing read cannot overlap multiple sites. Hence, when a bedfile has not been provided, this error should be ignored. When a suitable bedfile is provided, each observation of a covered site is independent, and the error around the coverage is equal to the binomial error estimate. This error is then propagated during the calculation of relative coverage for the X and Y chromosomes.
 
+> Note that in nf-core/eager this will be run on single- and double-stranded variants of the same library _separately_. This can also help assess for differential contamination between libraries.
+
 #### Relative Coverage
 
 Theoretically, males are expected to cluster around (0.5, 0.5) in the produced scatter plot, while females are expected to cluster around (1.0, 0.0). In practice, when analysing ancient DNA, these relative coverage on both axes is slightly lower than expected, and individuals can cluster around (0.45, 0.45) and (0.85, 0.05). As the number of covered sites for an individual gets smaller, the confidence on the estimate becomes lower, because it is increasingly more likely to be affected by randomness in the preservation and sequencing of ancient DNA.
@@ -667,7 +671,11 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
 - `metagenomic_complexity_filter` - this contains the output from filtering of input reads to metagenomic classification of low-sequence complexity reads as performed by `bbduk`. This will include the filtered FASTQ files (`*_lowcomplexityremoved.fq.gz`) and also the run-time log (`_bbduk.stats`) for each sample. **Note:** there are no sections in the MultiQC report for this module, therefore you must check the `._bbduk.stats` files to get summary statistics of the filtering.
 - `metagenomic_classification/` - this contains the output for a given metagenomic classifier.
   - Running MALT will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc. This will also include gzip SAM files if requested.
-  - Running kraken will contain the Kraken output and report files, as well as a merged Taxon count table.
+  - Running kraken will contain the Kraken output and report files, as well as a merged Taxon count table. You will also get a Kraken kmer duplication table, in a [KrakenUniq](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0) fashion. This is very useful to check for breadth of coverage and detect read stacking. A small number of aligned reads (low coverage) and a kmer duplication >1 is usually a sign of read stacking, usually indicative of a false positive hit (e.g. from over-amplified libraries). *Kmer duplication is defined as: number of kmers / number of unique kmers*. You will find two kraken reports formats available:  
+    - the `*.kreport` which is the old report format, without distinct minimizer count information, used by some tools such as [Pavian](https://github.com/fbreitwieser/pavian)
+    - the `*.kraken2_report` which is the new kraken report format, with the distinct minimizer count information.  
+
+    Finally, the `*.kraken.out` file are the direct output of Kraken2
 - `maltextract/` - this contains a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA)
 - `consensus_sequence/` - this contains three FASTA files from VCF2Genome of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainty system used for other downstream tools, respectively.
 - `librarymerged_bams/` - these contain the final BAM files that would go into genotyping (if genotyping is turned on). This means the files will contain all libraries of a given sample (including trimmed non-UDG or half-UDG treated libraries, if BAM trimming turned on)