-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merfin_hist module added #5300
merfin_hist module added #5300
Changes from 1 commit
951032f
383a1fa
c2342b4
f42872b
653218e
89cf292
60a7dba
3afb95b
8487a44
61f2ea5
dcf17cc
0c39191
53c2b46
1774f78
9d0f89b
9f892b5
6920a61
7b29d1b
dd2757c
bf021bf
4685ac9
c331b11
2d5ea49
fc63cd1
15e2db9
1332943
f9fdd0b
a83c25d
f67a420
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
--- | ||
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json | ||
name: "merfin_hist" | ||
channels: | ||
- conda-forge | ||
- bioconda | ||
- defaults | ||
dependencies: | ||
- "bioconda::merfin=1.0" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
process MERFIN_HIST { | ||
tag "$meta.id" | ||
label 'process_medium' | ||
|
||
conda "${moduleDir}/environment.yml" | ||
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? | ||
'https://depot.galaxyproject.org/singularity/merfin:1.0--h4ac6f70_2': | ||
'biocontainers/merfin:1.0--h4ac6f70_2' }" | ||
|
||
input: | ||
tuple val(meta), path(fasta_assembly) // Required Input -sequence files can be FASTA or FASTQ; uncompressed, gz compressed. | ||
tuple val(meta1), path(meryl_db_reads) // Required readmers (raw reads meryl db). As it comes from another tool, it might be relevant to mantain the meta. | ||
path(lookup_table) // Optional input vector of probabilities (obtained by genomescope2 with parameter --fitted_hist). | ||
path(seqmers) // Optional input for pre-built sequence meryl db (-seqmers). | ||
val(peak) // Required input to hard set copy 1 and infer multiplicity to copy number. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to make sure: the last 3 inputs are not sample-specific? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, they are sample-specific. Do you think I should change the logic? Concerning the last input, it is a value extracted from the results of other tools, therefore my plan was to keep it simple without a meta attached. The other inputs are optional. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It depends on the use case. If you believe input synchronization will work well without meta, I think you can leave it as is. |
||
|
||
output: | ||
tuple val(meta), path("*.hist") , emit: hist | ||
path("*.hist.stderr.log") , emit: log_stderr | ||
path "versions.yml" , emit: versions | ||
|
||
when: | ||
task.ext.when == null || task.ext.when | ||
|
||
script: | ||
def args = task.ext.args ?: '' | ||
def prefix = task.ext.prefix ?: "${meta.id}" | ||
def optional_lookup_table = lookup_table ? "-prob ${lookup_table}" : "" | ||
def optional_seqmers = seqmers ? "-seqmers ${seqmers}" : "" | ||
""" | ||
merfin -hist \\ | ||
-threads $task.cpus \\ | ||
$args \\ | ||
-sequence $fasta_assembly \\ | ||
-readmers $meryl_db_reads \\ | ||
-peak $peak \\ | ||
$optional_lookup_table \\ | ||
$optional_seqmers \\ | ||
-output ${prefix}.hist \\ | ||
2> >( tee ${prefix}.hist.stderr.log >&2 ) | ||
|
||
cat <<-END_VERSIONS > versions.yml | ||
"${task.process}": | ||
merfin: \$( merfin --version |& sed 's/merfin //' ) | ||
END_VERSIONS | ||
""" | ||
|
||
stub: | ||
def args = task.ext.args ?: '' | ||
def prefix = task.ext.prefix ?: "${meta.id}" | ||
def optional_lookup_table = lookup_table ? "-prob ${lookup_table}" : "" | ||
def optional_seqmers = seqmers ? "-seqmers ${seqmers}" : "" | ||
""" | ||
touch ${prefix}.hist | ||
touch ${prefix}.hist.log | ||
|
||
cat <<-END_VERSIONS > versions.yml | ||
"${task.process}": | ||
merfin: \$( merfin --version |& sed 's/merfin //' ) | ||
END_VERSIONS | ||
""" | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
--- | ||
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/meta-schema.json | ||
name: "merfin_hist" | ||
description: Compare k-mer frequency in reads and assembly to devise the metrics K* and QV* | ||
keywords: | ||
- assembly | ||
- evaluation | ||
- quality | ||
- completeness | ||
tools: | ||
- "merfin": | ||
description: "Merfin (k-mer based finishing tool) is a suite of subtools to variant filtering, assembly evaluation and polishing via k-mer validation. The subtool -hist estimates the QV (quality value of [Merqury](https://github.com/marbl/merqury)) for each scaffold/contig and genome-wide averages. In addition, Merfin produces a QV* estimate, which accounts also for kmers that are seen in excess with respect to their expected multiplicity predicted from the reads." | ||
homepage: "https://github.com/arangrhie/merfin" | ||
documentation: "https://github.com/arangrhie/merfin/wiki/Best-practices-for-Merfin" | ||
doi: "10.1038/s41592-022-01445-y" | ||
licence: ["Apache-2.0"] | ||
|
||
input: | ||
- meta: | ||
type: map | ||
description: | | ||
Groovy Map containing sample information | ||
e.g. `[ id:'sample1', single_end:false ]` | ||
|
||
- fasta_assembly: | ||
type: file | ||
description: Genome assembly in FASTA; uncompressed, gz compressed [REQUIRED] | ||
pattern: "*.{fasta, fasta.gz}" | ||
|
||
- meta1: | ||
type: map | ||
description: | | ||
Groovy Map containing sample read information | ||
e.g. `[ id:'sample1', single_end:false ]` | ||
|
||
- meryl_db_reads: | ||
type: file | ||
description: K-mer database produced from raw reads using Meryl [REQUIRED] | ||
pattern: "*.{meryl_db}" | ||
|
||
- lookup_table: | ||
type: file | ||
description: Input vector of k-mer probabilities (obtained by genomescope2 with parameter --fitted_hist) [OPTIONAL] | ||
pattern: "lookup_table.txt" | ||
|
||
- seqmers: | ||
type: file | ||
description: Input for pre-built sequence meryl db. By default, the sequence meryl db will be generated from the input genome assembly [OPTIONAL] | ||
pattern: "*.{meryl_db}" | ||
|
||
- peak: | ||
type: float | ||
description: Input to hard set copy 1 and infer multiplicity to copy number. Can be calculated using genomescope2 [REQUIRED] | ||
|
||
output: | ||
- meta: | ||
type: map | ||
description: | | ||
Groovy Map containing sample information | ||
e.g. `[ id:'sample1', single_end:false ]` | ||
|
||
- versions: | ||
type: file | ||
description: File containing software versions | ||
pattern: "versions.yml" | ||
|
||
- hist: | ||
type: file | ||
description: The generated 0-centered k* histogram for sequences in <fasta_assembly.fasta>. Positive k* values are expected collapsed copies. Negative k* values are expected expanded copies. Closer to 0 means the expected and found k-mers are well balenced, 1:1. | ||
pattern: "*.{hist}" | ||
|
||
- log_stderr: | ||
type: file | ||
description: Log (stderr) of hist tool execution. The QV and QV* metrics are reported at the end. | ||
pattern: "*.{hist.stderr.log}" | ||
|
||
authors: | ||
- "@rodtheo" | ||
maintainers: | ||
- "@rodtheo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be fine, although you could consider using a single input channel with one meta and two paths. This would probably be more convenient in pipelines, especially because meta1 is not output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an option. I do think that using meta1 can simplify thinks and make the code clean because in second channel (
tuple val(meta1), path(meryl_db_reads)
) we could directly use the output channelmeryl_db
from module MERYL_HISTOGRAM. Let me know if you have another opinion and thank you very much for your reviews.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional comment: I'll try to fix the
nf-test
failures in conda and singularity environment.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slightly controversial and both approaches have some rationale behind them. As long as you have an idea of how it will be integrated in pipelines, I don't see much issue with having two channels if it makes things simpler.