Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merfin_hist module added #5300

Closed
wants to merge 29 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
951032f
merfin_hist module added
rodtheo Mar 20, 2024
383a1fa
Add module seqfu/stats (#5275)
telatin Mar 20, 2024
c2342b4
nf-test bases2fastq (#5272)
kobelavaerts Mar 20, 2024
f42872b
update paths for VEP (#5281)
maxulysse Mar 20, 2024
653218e
4557 new module kaijumergeoutputs + stub Kraken2/Kraken2 (#5249)
Joon-Klaps Mar 20, 2024
89cf292
gatk4_asereadcounter add updated meta and nf-tests (#5164)
Lucpen Mar 20, 2024
60a7dba
Add README to modules build with Docker (#4935)
maxulysse Mar 20, 2024
3afb95b
add cram/index support to bwamem2 (#5248)
matthdsm Mar 20, 2024
8487a44
Bamstats (#4474)
johnoooh Mar 20, 2024
61f2ea5
Add wittyer as module (#5171)
famosab Mar 20, 2024
dcf17cc
Remove unnecessary .view() in subworkflows/nf-core/vcf_phase_shapeit5…
dimple-aspiring-cat Mar 20, 2024
0c39191
Igv reports (#5263)
soulj Mar 20, 2024
53c2b46
Revert "update kallistobustools count output list" (#5307)
fmalmeida Mar 20, 2024
1774f78
Revert "add paths in output directive in cellranger cout module" (#5306)
fmalmeida Mar 20, 2024
9d0f89b
Remove AMRFinderPlus DB update on each invocation (#5232)
oschwengers Mar 20, 2024
9f892b5
Add subworkflow mapAD (#5239)
jch-13 Mar 20, 2024
6920a61
Leviosam2 index (#5316)
lgrochowalski Mar 20, 2024
7b29d1b
Added contrast limited adaptive histogram equalization module (#5268)
kbestak Mar 20, 2024
dd2757c
add cram/index support to dragmap (#5303)
matthdsm Mar 20, 2024
bf021bf
scimap/spatiallda (#5260)
migueLib Mar 20, 2024
4685ac9
New module svtypersso (#5311)
tstoeriko Mar 20, 2024
c331b11
Add and update - picard/gatk4 - addorreplacereadgroups (#5302)
tomiles Mar 20, 2024
2d5ea49
Tcoffee tcs (#5288)
alessiovignoli Mar 20, 2024
fc63cd1
Add module nanofilt (#5290)
lfreitasl Mar 20, 2024
15e2db9
Add president module (#5256)
paulwolk Mar 20, 2024
1332943
merfin_hist module added
rodtheo Mar 20, 2024
f9fdd0b
nf-test with multiple inputs fixed
rodtheo Mar 20, 2024
a83c25d
nf-tests insert file URLs instead of test_data
rodtheo Mar 20, 2024
f67a420
fix merging issues
rodtheo Mar 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions modules/nf-core/merfin/hist/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json
name: "merfin_hist"
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- "bioconda::merfin=1.0"
62 changes: 62 additions & 0 deletions modules/nf-core/merfin/hist/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
process MERFIN_HIST {
tag "$meta.id"
label 'process_medium'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/merfin:1.0--h4ac6f70_2':
'biocontainers/merfin:1.0--h4ac6f70_2' }"

input:
tuple val(meta), path(fasta_assembly) // Required Input -sequence files can be FASTA or FASTQ; uncompressed, gz compressed.
tuple val(meta1), path(meryl_db_reads) // Required readmers (raw reads meryl db). As it comes from another tool, it might be relevant to mantain the meta.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be fine, although you could consider using a single input channel with one meta and two paths. This would probably be more convenient in pipelines, especially because meta1 is not output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an option. I do think that using meta1 can simplify thinks and make the code clean because in second channel (tuple val(meta1), path(meryl_db_reads)) we could directly use the output channel meryl_db from module MERYL_HISTOGRAM. Let me know if you have another opinion and thank you very much for your reviews.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comment: I'll try to fix the nf-test failures in conda and singularity environment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly controversial and both approaches have some rationale behind them. As long as you have an idea of how it will be integrated in pipelines, I don't see much issue with having two channels if it makes things simpler.

path(lookup_table) // Optional input vector of probabilities (obtained by genomescope2 with parameter --fitted_hist).
path(seqmers) // Optional input for pre-built sequence meryl db (-seqmers).
val(peak) // Required input to hard set copy 1 and infer multiplicity to copy number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure: the last 3 inputs are not sample-specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are sample-specific. Do you think I should change the logic? Concerning the last input, it is a value extracted from the results of other tools, therefore my plan was to keep it simple without a meta attached. The other inputs are optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the use case. If you believe input synchronization will work well without meta, I think you can leave it as is.


output:
tuple val(meta), path("*.hist") , emit: hist
path("*.hist.stderr.log") , emit: log_stderr
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def optional_lookup_table = lookup_table ? "-prob ${lookup_table}" : ""
def optional_seqmers = seqmers ? "-seqmers ${seqmers}" : ""
"""
merfin -hist \\
-threads $task.cpus \\
$args \\
-sequence $fasta_assembly \\
-readmers $meryl_db_reads \\
-peak $peak \\
$optional_lookup_table \\
$optional_seqmers \\
-output ${prefix}.hist \\
2> >( tee ${prefix}.hist.stderr.log >&2 )

cat <<-END_VERSIONS > versions.yml
"${task.process}":
merfin: \$( merfin --version |& sed 's/merfin //' )
END_VERSIONS
"""

stub:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def optional_lookup_table = lookup_table ? "-prob ${lookup_table}" : ""
def optional_seqmers = seqmers ? "-seqmers ${seqmers}" : ""
"""
touch ${prefix}.hist
touch ${prefix}.hist.log

cat <<-END_VERSIONS > versions.yml
"${task.process}":
merfin: \$( merfin --version |& sed 's/merfin //' )
END_VERSIONS
"""
}
80 changes: 80 additions & 0 deletions modules/nf-core/merfin/hist/meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/meta-schema.json
name: "merfin_hist"
description: Compare k-mer frequency in reads and assembly to devise the metrics K* and QV*
keywords:
- assembly
- evaluation
- quality
- completeness
tools:
- "merfin":
description: "Merfin (k-mer based finishing tool) is a suite of subtools to variant filtering, assembly evaluation and polishing via k-mer validation. The subtool -hist estimates the QV (quality value of [Merqury](https://github.com/marbl/merqury)) for each scaffold/contig and genome-wide averages. In addition, Merfin produces a QV* estimate, which accounts also for kmers that are seen in excess with respect to their expected multiplicity predicted from the reads."
homepage: "https://github.com/arangrhie/merfin"
documentation: "https://github.com/arangrhie/merfin/wiki/Best-practices-for-Merfin"
doi: "10.1038/s41592-022-01445-y"
licence: ["Apache-2.0"]

input:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1', single_end:false ]`

- fasta_assembly:
type: file
description: Genome assembly in FASTA; uncompressed, gz compressed [REQUIRED]
pattern: "*.{fasta, fasta.gz}"

- meta1:
type: map
description: |
Groovy Map containing sample read information
e.g. `[ id:'sample1', single_end:false ]`

- meryl_db_reads:
type: file
description: K-mer database produced from raw reads using Meryl [REQUIRED]
pattern: "*.{meryl_db}"

- lookup_table:
type: file
description: Input vector of k-mer probabilities (obtained by genomescope2 with parameter --fitted_hist) [OPTIONAL]
pattern: "lookup_table.txt"

- seqmers:
type: file
description: Input for pre-built sequence meryl db. By default, the sequence meryl db will be generated from the input genome assembly [OPTIONAL]
pattern: "*.{meryl_db}"

- peak:
type: float
description: Input to hard set copy 1 and infer multiplicity to copy number. Can be calculated using genomescope2 [REQUIRED]

output:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1', single_end:false ]`

- versions:
type: file
description: File containing software versions
pattern: "versions.yml"

- hist:
type: file
description: The generated 0-centered k* histogram for sequences in <fasta_assembly.fasta>. Positive k* values are expected collapsed copies. Negative k* values are expected expanded copies. Closer to 0 means the expected and found k-mers are well balenced, 1:1.
pattern: "*.{hist}"

- log_stderr:
type: file
description: Log (stderr) of hist tool execution. The QV and QV* metrics are reported at the end.
pattern: "*.{hist.stderr.log}"

authors:
- "@rodtheo"
maintainers:
- "@rodtheo"
Loading
Loading